[rabbitmq-discuss] Cluster unresponsive after some stop/start of a node.

Sun May 20 08:08:42 BST 2012

Here is my setup:

   rabbit at qwe is at 10.0.0.1 (initially the master)
   rabbit at asd is at 10.0.0.2 (initially a slave)

asd has joined the cluster with qwe -- OK.

In my tests I need to stop/start a cluster node -- qwe, which is a
master for my test queues. I use /usr/sbin/rabbitmqctl {stop|start}_app
for it -- everything is OK.

In order to test slave promotion, I first stop the master (qwe), then
after some time I start it, so that it now becomes a slave.

At the end of the test I stop asd, then start it, so that qwe takes
queues mastership back over.

During my test, the cluster serves two clients: a message producer and
a message consumer, running some low rate communication through the 
slave node (asd).

Now after a couple of tests, when attempting to do start_app on asd, I
get (after some pause):

Starting node 'rabbit at asd' ...
Error: {cannot_start_application,rabbit,
            {bad_return,
                {{rabbit,start,[normal,[]]},
                 {'EXIT',{rabbit,failure_during_boot}}}}}

cluster_status on qwe says:

Cluster status of node 'rabbit at qwe' ...
[{nodes,[{disc,['rabbit at qwe']},{ram,['rabbit at asd']}]},
  {running_nodes,['rabbit at qwe']}]
...done.

And cluster_status on asd says:

Cluster status of node 'rabbit at asd' ...
[{nodes,[{unknown,['rabbit at asd']}]},{running_nodes,[]}]
...done.

Now I want to remove asd from the cluster... An attempt to run
stop_app/reset on asd gives (after some pause as well):

Resetting node 'rabbit at asd' ...
Error: {timeout_waiting_for_tables,[gm_group]}

In this situation I can only throw the entire cluster away and create
a new one...

How can I recover from this situation?

Thanks,
Markiyan.