Last night we had our first major production RabbitMQ incident. We we standing up a new cluster that was going to receive messages from the first cluster through a shovel. Sysops had not shutdown the Erlang ports between the new and existing cluster, and the base image used the same Erlang cookie. Inadvertently one of the nodes of the new cluster appears to have had an old config pointing to the old cluster. That node joined the existing cluster as a RAM node. The sysops tried a few things to remove the node from the existing cluster without success, then I was called in.<div>
<br></div><div>By the time I came in the new node had been reconfigured to join the new cluster, but the old cluster still believed it to be a member. Since RabbitMQ has no mechanism to remove a node from a cluster except from the node itself, I had them take the new node and join it once more to the existing cluster, then remove it using the reset command on the new node. That left the new node thinking it was not a member of the cluster, but the existing cluster still thought it was a RAM node. We tried this a few times unsuccessfully.</div>
<div><br></div><div>We figured we could try taking down each node in the existing cluster one at a time, reseting it, and having it rejoin the cluster, hoping that it would clear whatever issue it had. So long as we kept at least one disk node in the existing cluster our state should be maintained (mines any non-HA queues and messages in them). Apparently we choose the wrong node. When we tried to have it rejoin the remaining node in the existing cluster it failed. It seems Mnesia on the remaining node was corrupted somehow.</div>
<div><br></div><div>In the end we had to remove the RabbitMQ lib directory and rebuild the existing cluster from scratch. As you can imagine that was not much fun.</div><div><br></div><div>This is not the first time I've been Mnesia become confused somehow.</div>
<div><br></div><div>A few suggestions:</div><div><br></div><div>Can we please have a command that allows you to remove a node from a cluster from a node other than the node you a trying to remove? Bring back up a node just to remove it from the cluster is time consuming and potentially error prone. </div>
<div><br></div><div>Can we please have some tools that will analyze Mnesia on each node and give us an idea of its health? Whether its corrupted or somehow out of sync with other nodes in the cluster.</div><div><br></div>
<div><br></div><div>Elias Levy</div><div><br></div>