[rabbitmq-discuss] can't restart rabbit cluster after power outage

Fri Jun 20 17:45:05 BST 2014

On 20/06/14 16:55, Ben Hsu wrote:
> Hello, our rabbitmq cluster suffered a power outage, and we’re having
> trouble bringing it back up.
>
> our cluster has 2 disk backed nodes (node1 and node2) and 1 ram backed
> node (node3). I first tried to restart the disk backed nodes, and they
> both gave me an error, saying “timeout_waiting_for_tables” on the other
> two nodes. Googled around, and it sounded like the ram node was the last
> one to go out.

Nodes attempt to remember if they were the last to shut down, and try to 
enforce the idea that the last to shut down should be the first back up. 
However, RAM nodes don't count for that, only disc.

If the whole thing suffered a power outage, it's possible that no node 
saw any other node die before it died itself, and so each disc node 
wants to wait for the other.

> So I tried restarting the ram backed node, and it started fine. But when
> I tried to start the disk backed node, it gave me a different error,
> basically saying “inconsistent_cluster, thinks its clustered with node3,
> but node3 disagrees”.

The ram node should have refused to start alone. We had a bug where it 
would start in older versions and then get confused - which RabbitMQ 
version are you running?

> What I would love to do is take one of the disk nodes, start it as the
> master, and tell the other nodes to join its cluster. Is that possible?
> Right now I cannot even run “rabbitmqctl cluster_status” because the
> node won’t start

What you want is "rabbitmqctl forget_cluster_node --offline". This will:

1) Allow you to tell node1 or node2 that node3 has left the cluster 
(you'll need to re-add it later).

2) Reset node1 or node2's idea of which nodes were the last to shut 
down, allowing the cluster to start again.

"rabbitmqctl forget_cluster_node --offline" is currently a bit of a pain 
to use, since you have to start an Erlang node without booting RabbitMQ.

You can do this by adding "NODE_ONLY=true" to 
/etc/rabbitmq/rabbitmq-env.conf on node1 or node2. Attempting to start 
RabbitMQ in whatever's the normal way for you will get an Erlang node 
started without RabbitMQ (i.e. as if you'd successfully booted the 
server then invoked "rabbitmqctl stop_app").

You can now invoke "rabbitmqctl forget_cluster_node --offline node3"

Once you've done that, you can stop your node, remove NODE_ONLY=true and 
it should start correctly. The other disc node should then be able to 
start up and join the cluster without further fiddling.

> meta question: is having a mix of disk and RAM based nodes in the same
> cluster a Bad Idea that needs to be fixed?

It didn't cause this problem. But overall RAM nodes are quite a special 
thing, they exist to speed up queue / exchange / binding declarations in 
large clusters. Most clusters don't need them.

Cheers, Simon

-- 
Simon MacMullen
RabbitMQ, Pivotal