[rabbitmq-discuss] HA behavior during a network split

Wed Jul 11 10:24:54 BST 2012

Hi Elias,

As I'm fairly new around here, I'll try and share what I've learned so 
far and allow the more experienced folks to chip in and fill in the 
details (or correct me if I go astray).

On 07/11/2012 12:11 AM, Elias Levy wrote:
> I am curious as to what the behavior of HA queues during a network 
> split is.
>
> The documentation states that when a mater fails a slave will be 
> promoted to master, but its silent under what conditions a slave will 
> consider a master to have failed.  Is there some timeout after which 
> slaves will consider a master to have failed?  If so, what is the time 
> value?
>

This situation is not handled using a timeout. HA queues are based on a 
technology called Guaranteed Multicast (aka GM), which was developed 
independently by and for RabbitMQ. This provides an atomic broadcast 
capability which is similar to the work described by Levy et al 
(biblion.epfl.ch/EPFL/theses/2008/3999/EPFL_TH3999.pdf) though as I 
mentioned earlier (and as per the documentation), was developed 
independently.

You can take a look at the GM source code here: 
http://hg.rabbitmq.com/rabbitmq-server/file/default/src/gm.erl

A GM group forms a ring, in which members are connected to their 
immediate neighbours (in both directions) only. If this connection 
breaks then the death of the member is propagated around the ring and 
everything 'reshuffles' to compensate for this. The deaths are noticed 
because the Erlang processes involved are monitored (see the links under 
[monitors] at the bottom for technical details) and the guarantees and 
relative timings involved can be understood in that context.

In actual fact, mirror (i.e., HA) queues are implemented 'on top of' GM 
and also rely on Rabbit's clustering infrastructure, so additional 
(Erlang) process and node monitoring is in place at the level above GM 
which will also *notice* if a node goes down.

> Assuming that such timeout exists, if there is a network split you may 
> end up with two clusters, each one which now has a master.  Each may 
> also have publisher and consumers that continue to work happily 
> against the split cluster.
>

Now we're talking about two different things. Rabbit clustering is 
independent of mirror (HA) queues, though the two things are 
interdependent. If a netsplit occurs then the surviving nodes which are 
still connected to the extant master *should* continue happily on. What 
will happen to the nodes in the other 'half' of the split, I'm not so 
sure and will put my hand up and ask someone better versed in this to 
fill in the blanks.

> What happens when the network split is repaired?  Will the clusters 
> join?  If so, what will happen to the HA queue?  Will one of the 
> existing master be demoted to slave?  If so, what happens to its queue 
> of messages that originated within its split cluster?  Are they lost?
>

AFAIK it is possible for MNesia to heal itself after a netsplit, and 
therefore getting nodes to rejoin a cluster might work without 
intervention, possibly depending on what has happened independently on 
the two 'halves' of the split in the intervening time period. What I 
would not expect to happen (though I could be wrong here!) is for two 
distinct GM rings to join up and become one, promoting a new master or 
demoting an existing one, the latter behaviour being undefined (i.e., 
not implemented) AFAICT.

When a node rejoins a cluster, mnesia needs to reconcile the differences 
and I would expect to see mnesia fail when trying to rejoin the cluster 
if the (Erlang) process ID for the master was different between the two 
nodes.

> I suppose a lot of this depends on the underlaying Mnesia DB. 
>  I realize RMQ is CA system out the CAP theorem, but its not at all 
> clear what occurs in the face of a network partition.
>

Yes indeed - mnesia does not play nicely in this kind of scenario. There 
are some efforts underway to make it *easier* to deal with netsplits 
(for example 
https://github.com/uwiger/otp/commit/3f70f3def4e33828da4237b07cbee9f73121c661 
and https://github.com/uwiger/unsplit) but these are not mainstream or 
ready to production use just yet.

And even if some mechanism were available, we would have the dual 
problems of deciding on which mnesia record is the correct (system of 
record) *and* being able to join 2 GM rings back together, which sounds 
infeasibly hard to me.

[monitors]
http://www.erlang.org/doc/reference_manual/processes.html#id82613
http://www.erlang.org/doc/man/erlang.html#monitor-2
http://www.erlang.org/doc/man/net_kernel.html

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120711/c4b5a8a8/attachment.htm>