[erlang-questions] auto-syncing mnesia after a network split

Rick Pettit rpettit@REDACTED
Tue Dec 2 22:09:27 CET 2008


On Tue, December 2, 2008 2:40 pm, Joel Reymont wrote:
> Rick,
>
> On Dec 2, 2008, at 8:30 PM, Rick Pettit wrote:
>
>> (e.g. how can a bank ATM allow
>> me to withdraw funds if it cannot reach its peer node(s) at my bank to
>> determine the availability of such funds?).
>
> In my scenario a bank ATM would have an internal Mnesia table with the
> balance :-). The ATM would clearly be part of a cluster of ATM,
> replicating their transactions and balances to all other ATMS in the
> cluster.

Supposing there were 2 ATM machines in my neighborhood, each with a table
containing bank account balances, mine included.

Now suppose the first ATM became disconnected from its cluster, which
included the second ATM.  Suppose I visit the first ATM and withdraw all
the funds from my account. The first ATM records this in a transaction log
but cannot replicate the transaction due to the network partition.

So I make my way down the street to the second ATM, which also has a
record of my balance _before_ I made the withdrawal at the first ATM. So I
again withdraw all my funds.

I am in this particular case quite happy, though I suspect my bank might
not be when they work to mend the network partition. And no, they can't
have their, er, my money back :-)

>> Most systems I work with implement a recovery procedure similar to
>> what
>> Ulf has posted in the past on this list.
>
> Would you kindly post a link to that procedure in this thread, for
> easier reference?

No problem:

  http://www.erlang.org/pipermail/erlang-questions/2001-January/002484.html
  http://www.erlang.org/pipermail/erlang-questions/2006-February/019092.html
  http://www.erlang.org/pipermail/erlang-questions/2007-January/024716.html

>> Because the systems I am referring to require high-availability over
>> 100%
>> data consistency, this is perfectly ok (and works quite well). With
>> issues
>> like telecom "glare" I couldn't be 100% accurate all the time anyway.
>
> What's telecom glare?

A quick google search turned up the following:

glare telecom definition

The condition that arises when a telephone line or trunk is seized at both
ends for different reasons, perhaps causing the collision between an
incoming call and an outgoing call, for example. Glare is a phenomenon
associated with loop start signaling used to support single-line
telephones, multi-line telephones, and key telephone systems (KTSs).When
the handset of the telephone is lifted, the electrical loop is completed
and current flows across the circuit.The central office switch detects
that fact and returns dial tone for an outgoing call, or connects an
incoming call, as appropriate. If the user picks up the handset to place
an outgoing call at the same time that the central office switch is
attempting to connect an incoming call, a collision, or glare condition,
occurs. See also ground start, loop, and loop start.

===

This affects my software in that there is the chance of sending a call out
a trunk (during periods of peak call volume) which may actually be in use
by the time the call lands there. The system has a means of detecting and
handling this outside of any mnesia code (call retries, etc).

>> So, to recover from a partition it is enough to pick any functioning
>> node
>> as the new "master" and have others restart and/or force load tables
>> from
>> it. The entire time clients keep pushing new stats into the system, so
>> everything "converges on reality" in the end following a recovery
>> attempt
>> anyway.
>
>
> I understand that Mnesia was designed for telco ops but I want to run
> my social network on top of it. I did a search before and all the
> solutions were along the lines of "I'm dealing with telco stuff or I
> can just throw that data out". I don't have such luxury and don't want
> to throw Mnesia out in favor of PostgreSQL until I absolutely have to.

Agreed.

This is a really tough problem to solve for sure. If I didn't have the
luxury to potentially lose transactions my solution would not work.

I'll continue to follow the thread and add anything I think might be
useful if it comes to me.

-Rick




More information about the erlang-questions mailing list