[rabbitmq-discuss] Mnesia corrupting after node joining cluster
dbrown at prmllc.com
Wed Aug 1 18:20:44 BST 2012
has there been any work on this issue (i.e. errors when doing admin work on
a cluster)? I've got a tiny, two node development cluster. Removing the
ram node caused the remaining (disc) node to fail on startup with a mnesia
related error when I restarted it. Eventually, the remaining (disc) node
started up, but it still thinks the other node is clustered with it. I've
tried everything to try and get this node to realize it is the only node
left in the cluster, nothing works. FWIW the removed node does realize it
is no longer part of the two node cluster.
I was very careful in terms of following the exact steps in the 'Breaking up
a cluster' section on the rabbitmq web site. At this point, I'm a bit
concerned about basing our production systems around rabbitmq (we're a small
hedge fund) when it seems to fail on the simplest of tasks. The only thing
I can think of to solve this problem would be to reinstall rabbitmq on all
nodes in the cluster (similar to what Eli had to do) which is from a
production point of view unacceptible.
Well any help/insight woudl be greatly appreciated
Thanks in advance
----- Original Message -----
From: "Francesco Mazzoli" <francesco at rabbitmq.com>
To: <rabbitmq-discuss at lists.rabbitmq.com>
Sent: Tuesday, April 24, 2012 12:30 PM
Subject: Re: [rabbitmq-discuss] Mnesia corrupting after node joining cluster
> Hi Eli,
>> That left the new node thinking it was not a member of the
>> cluster, but the existing cluster still thought it was a RAM node.
> Uhm, if you mean that the cluster still thought that the reseted node was
> still part of the cluster, than that is weird and shouldn't happen. Can
> you reproduce this?
> If you force_reseted the node, than that behavior is to be expected.
>> We figured we could try taking down each node in the existing cluster
>> one at a time, reseting it, and having it rejoin the cluster, hoping
>> that it would clear whatever issue it had. So long as we kept at least
>> one disk node in the existing cluster our state should be maintained
>> (mines any non-HA queues and messages in them). Apparently we choose
>> the wrong node. When we tried to have it rejoin the remaining node in
>> the existing cluster it failed. It seems Mnesia on the remaining node
>> was corrupted somehow.
> If the cluster doesn't know that the other node left (which seems to be
> the case here) than it will try to sync its tables with the node it thinks
> is still in the cluster. However since the other node has been reset its
> tables will have different cookies and Mnesia will blow up. Posting the
> specific error would help in confirming that this was the problem.
>> Can we please have a command that allows you to remove a node from a
>> cluster from a node other than the node you a trying to remove? Bring
>> back up a node just to remove it from the cluster is time consuming and
>> potentially error prone.
> This is definitely a feature we're planning to implement. In general we
> want to make clustering more user friendly, and part of the work will be
> more clustering rabbitmqctl commands ("join_cluster", "depart_cluster",
>> Can we please have some tools that will analyze Mnesia on each node and
>> give us an idea of its health? Whether its corrupted or somehow out of
>> sync with other nodes in the cluster.
> If I understood correctly in this case the problem wasn't corrupted Mnesia
> tables, but the cluster not being aware of which nodes were part of the
> cluster. What's definitely needed is better error reporting when this kind
> of things happen, since right now the Mnesia error are particularly ugly.
> Again, we're working on that.
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 20702 bytes
Desc: not available
More information about the rabbitmq-discuss