[rabbitmq-discuss] autoheal behavior in the presence of HA-queues mastered on multiple nodes?

Mon Feb 24 10:57:54 GMT 2014

On 21/02/2014 11:49PM, Matt Pietrek wrote:
> Picture a two node cluster with nodes A and B, using HA-queues and
> autoheal. Some queues are mastered on A, and others on B. There's a VIP
> in front of the cluster that points to one of the brokers.

OK. Looking at how things evolve step by step:

> Now imagine a network event occurs and the cluster splits.

At this point the network is partitioned: both side of the cluster are 
running separately, each believing that the other has gone down. So on 
A, all the queues that had masters on B fail over to A, and vice versa on B.

So this is the nub of why network partitions are a big deal; the two 
sides of the cluster both think they are authoritative and both start to 
evolve separately.

> Autoheal kicks in,

Of course autoheal will not kick in until the underlying network 
partition is resolved; until the two sides can see each other they will 
not have a clue anything has gone wrong (well, more wrong than node 
failure).

> selects a loser (say, B), and restarts it to rejoin the
> cluster. What happens to the data in queues that were mastered on B when
> B restarts?

Hopefully at this point you have your answer: since A is the winning 
side, the queue state from A overwrites the queue state from B. And the 
queue state from A will be whatever the state was at the time of the 
split (even for queues mastered on B), with whatever changes A has seen 
since.

Now, since you said earlier "there is a VIP pointing at A" then maybe no 
changes have been taking place on B anyway. But if there were any 
changes on B, you lost them.

Partitions are a big deal, and autoheal is not a panacea.

> And does it matter if the messages in the queue were
> persistent or not?

No.

Cheers, Simon

-- 
Simon MacMullen
RabbitMQ, Pivotal