[rabbitmq-discuss] Mnesia corrupting after node joining cluster

Tue Apr 24 18:30:24 BST 2012

Hi Eli,

> That left the new node thinking it was not a member of the
> cluster, but the existing cluster still thought it was a RAM node.

Uhm, if you mean that the cluster still thought that the reseted node 
was still part of the cluster, than that is weird and shouldn't happen. 
Can you reproduce this?

If you force_reseted the node, than that behavior is to be expected.

> We figured we could try taking down each node in the existing cluster
> one at a time, reseting it, and having it rejoin the cluster, hoping
> that it would clear whatever issue it had.  So long as we kept at least
> one disk node in the existing cluster our state should be maintained
> (mines any non-HA queues and messages in them).  Apparently we choose
> the wrong node.  When we tried to have it rejoin the remaining node in
> the existing cluster it failed.  It seems Mnesia on the remaining node
> was corrupted somehow.

If the cluster doesn't know that the other node left (which seems to be 
the case here) than it will try to sync its tables with the node it 
thinks is still in the cluster. However since the other node has been 
reset its tables will have different cookies and Mnesia will blow up. 
Posting the specific error would help in confirming that this was the 
problem.

> Can we please have a command that allows you to remove a node from a
> cluster from a node other than the node you a trying to remove? Bring
> back up a node just to remove it from the cluster is time consuming and
> potentially error prone.

This is definitely a feature we're planning to implement. In general we 
want to make clustering more user friendly, and part of the work will be 
more clustering rabbitmqctl commands ("join_cluster", "depart_cluster", 
etc.).

> Can we please have some tools that will analyze Mnesia on each node and
> give us an idea of its health?  Whether its corrupted or somehow out of
> sync with other nodes in the cluster.

If I understood correctly in this case the problem wasn't corrupted 
Mnesia tables, but the cluster not being aware of which nodes were part 
of the cluster. What's definitely needed is better error reporting when 
this kind of things happen, since right now the Mnesia error are 
particularly ugly. Again, we're working on that.

Francesco.