[rabbitmq-discuss] Mirror queue recovery questions

Elias Levy fearsome.lucidity at gmail.com
Fri Sep 9 18:03:45 BST 2011


>
> From: Matthew Sackman <matthew at rabbitmq.com>
>
> Yeah, this is a bit of a problem. Essentially, yes, there is a timeout,
> after which Rabbit will just give up trying to start. There are
> definitely ways in which we could embue rabbitmqctl with the means to
> tell rabbit to abandon all hope of some remote node ever rejoining (and
> indeed, even ways to do this without the local rabbit coming up, which
> is essential in this case), but we've not yet written this. There is a
> bug open for this.
>
> After that, the various local mnesia databases will try to merge
> themselves back together, but if two nodes disagree about various
> details and they can't delegate to a common node that was alive at the
> point they failed, then mnesia will give up with an "unable to merge
> schema" error, and there'll really be no hope at getting them both to
> merge back together without resetting one of those nodes.
>
> Erm possibly. Much of this ordering stuff comes from mnesia rather than
> Rabbit. If the node's mnesia is happy to start up then Rabbit will then
> restore queues without further delay. That may include the promotion of
> slaves to master.
>

Matthew,

Thanks for your help understanding the failure modes.

To summarize, if I understand correctly, in the event of a cluster failure,
if the master of a mirrored queue fails to recover, after some timeout the
remaining slaves may or may not choose a new master and recover.  Whether
they do so will depend on whether the underlaying Mnesia tables converge or
not.  And if they do not converge, you may well have to reset one or more
nodes, thus discarding any persisted messages in them.  In that case, you
better choose wisely, and reset the nodes most out of sync with the lost
master.

There is also currently no way to tell the cluster a node will not be
rejoining, and thus avoiding waiting for the timeout, but there is an open
ticket for this.

BTW, how long is this timeout?  Is it configurable?

So would it be fair to say that mirrored queues are fault tolerant to node
loss, but not necessarily to cluster loss?

Elias Levy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20110909/312764a2/attachment.htm>


More information about the rabbitmq-discuss mailing list