Last night we had our first major production RabbitMQ incident.  We we
standing up a new cluster that was going to receive messages from the first
cluster through a shovel.  Sysops had not shutdown the Erlang ports between
the new and existing cluster, and the base image used the same Erlang
cookie.  Inadvertently one of the nodes of the new cluster appears to have
had an old config pointing to the old cluster.  That node joined the
existing cluster as a RAM node.  The sysops tried a few things to remove
the node from the existing cluster without success, then I was called in.

By the time I came in the new node had been reconfigured to join the new
cluster, but the old cluster still believed it to be a member.  Since
RabbitMQ has no mechanism to remove a node from a cluster except from the
node itself, I had them take the new node and join it once more to the
existing cluster, then remove it using the reset command on the new node.
 That left the new node thinking it was not a member of the cluster, but
the existing cluster still thought it was a RAM node.  We tried this a few
times unsuccessfully.

We figured we could try taking down each node in the existing cluster one
at a time, reseting it, and having it rejoin the cluster, hoping that it
would clear whatever issue it had.  So long as we kept at least one disk
node in the existing cluster our state should be maintained (mines any
non-HA queues and messages in them).  Apparently we choose the wrong node.
 When we tried to have it rejoin the remaining node in the existing cluster
it failed.  It seems Mnesia on the remaining node was corrupted somehow.

In the end we had to remove the RabbitMQ lib directory and rebuild the
existing cluster from scratch.  As you can imagine that was not much fun.

This is not the first time I've been Mnesia become confused somehow.

A few suggestions:

Can we please have a command that allows you to remove a node from a
cluster from a node other than the node you a trying to remove? Bring back
up a node just to remove it from the cluster is time consuming and
potentially error prone.

Can we please have some tools that will analyze Mnesia on each node and
give us an idea of its health?  Whether its corrupted or somehow out of
sync with other nodes in the cluster.

Elias Levy
