[rabbitmq-discuss] Mirror queue recovery questions

Fri Sep 9 18:25:06 BST 2011

On Fri, Sep 09, 2011 at 10:03:45AM -0700, Elias Levy wrote:
> To summarize, if I understand correctly, in the event of a cluster failure,
> if the master of a mirrored queue fails to recover, after some timeout the
> remaining slaves may or may not choose a new master and recover.  Whether
> they do so will depend on whether the underlaying Mnesia tables converge or
> not.  And if they do not converge, you may well have to reset one or more
> nodes, thus discarding any persisted messages in them.  In that case, you
> better choose wisely, and reset the nodes most out of sync with the lost
> master.

I think that summary is consistent with my understanding of mnesia and
the HA code.

> There is also currently no way to tell the cluster a node will not be
> rejoining, and thus avoiding waiting for the timeout, but there is an open
> ticket for this.

Correct.

> BTW, how long is this timeout?  Is it configurable?

30 seconds, and no. Well, not without recompiling the broker.

> So would it be fair to say that mirrored queues are fault tolerant to node
> loss, but not necessarily to cluster loss?

That's fair. Also, I've recently been reminded that we don't really
claim to support network partitions - this is mainly due to the fact
that mnesia itself doesn't support network partitions really - hence the
issues with the tables needing to converge and resetting a node if it's
not. This is consistent with choosing C and A from the CAP triangle.

Matthew