[rabbitmq-discuss] Repairing a a crashed cluster

Wed Oct 10 16:53:19 BST 2012

On 10/10/12 14:49, Dave Seltzer wrote:
> My instinct, for the sake of uptime, is to say: "okay, forget node2,
> lets break the cluster and bring node1 online".
>
> My problem is that according to the docs I need to issue a "rabbitmqctl
> force_reset", which I can't do unless the server is running.
>
> I tried starting it using "rabbitmq-server -detached" but the server
> just exited after loading plugins.
>
> Does anyone know the right course of action in this scenario?

Hi. The bad news is that the released versions of RabbitMQ do not handle 
this situation well. The good news is that the next release will do better.

However, in your situation it is possible to break the cluster and bring 
node1 up. But it's a bit fiddly.

First of all, you will need to start node1 with the environment variable 
RABBITMQ_NODE_ONLY set to some value. This will start the Erlang VM 
without attempting to start RabbitMQ or Mnesia. Exactly how you do this 
depends on how you have RabbitMQ installed, but on Unix you would 
typically add that to /etc/rabbitmq/rabbitmq-env.conf. Note that our 
init scripts wait for RabbitMQ to start, so /etc/init.d/rabbitmq-server 
will hang, but the node will start.

Once you have the node running, you can then invoke:

   rabbitmqctl eval 'mnesia:start(),[mnesia:force_load_table(T) || T <- 
rabbit_mnesia:table_names()],mnesia:del_table_copy(schema, rabbit at node2).'

(all as one line), with node2 substituted in. This should respond with:

   {atomic,ok}

Then you can invoke

   rabbitmqctl stop

to stop node1 again. At this point it should have forgotten node2 and be 
able to start again normally.

Note that we don't use "rabbitmqctl force_reset" since that would reset 
node1, and the point is to make it forget node2.

Cheers, Simon

-- 
Simon MacMullen
RabbitMQ, VMware