[rabbitmq-discuss] Cluster unresponsive after some stop/start of a node.
Markiyan Kushnir
markiyan.kushnir at gmail.com
Sun May 20 08:08:42 BST 2012
Here is my setup:
rabbit at qwe is at 10.0.0.1 (initially the master)
rabbit at asd is at 10.0.0.2 (initially a slave)
asd has joined the cluster with qwe -- OK.
In my tests I need to stop/start a cluster node -- qwe, which is a
master for my test queues. I use /usr/sbin/rabbitmqctl {stop|start}_app
for it -- everything is OK.
In order to test slave promotion, I first stop the master (qwe), then
after some time I start it, so that it now becomes a slave.
At the end of the test I stop asd, then start it, so that qwe takes
queues mastership back over.
During my test, the cluster serves two clients: a message producer and
a message consumer, running some low rate communication through the
slave node (asd).
Now after a couple of tests, when attempting to do start_app on asd, I
get (after some pause):
Starting node 'rabbit at asd' ...
Error: {cannot_start_application,rabbit,
{bad_return,
{{rabbit,start,[normal,[]]},
{'EXIT',{rabbit,failure_during_boot}}}}}
cluster_status on qwe says:
Cluster status of node 'rabbit at qwe' ...
[{nodes,[{disc,['rabbit at qwe']},{ram,['rabbit at asd']}]},
{running_nodes,['rabbit at qwe']}]
...done.
And cluster_status on asd says:
Cluster status of node 'rabbit at asd' ...
[{nodes,[{unknown,['rabbit at asd']}]},{running_nodes,[]}]
...done.
Now I want to remove asd from the cluster... An attempt to run
stop_app/reset on asd gives (after some pause as well):
Resetting node 'rabbit at asd' ...
Error: {timeout_waiting_for_tables,[gm_group]}
In this situation I can only throw the entire cluster away and create
a new one...
How can I recover from this situation?
Thanks,
Markiyan.
More information about the rabbitmq-discuss
mailing list