[rabbitmq-discuss] Cluster unresponsive after some stop/start of a node.

Francesco Mazzoli francesco at rabbitmq.com
Sun May 20 11:09:18 BST 2012


Hi Markiyan,

It's hard to tell what happened without any logs, but I'm going to bet 
that you tried to start asd while qwe was down. Since asd is a RAM node, 
and we don't like standalone RAM nodes, the boot sequence failed.

If that's not the problem, please provide more precise instruction on 
how to reproduce the problem (if you can reproduce it).

Francesco.

On 20/05/12 08:08, Markiyan Kushnir wrote:
> Here is my setup:
>
> rabbit at qwe is at 10.0.0.1 (initially the master)
> rabbit at asd is at 10.0.0.2 (initially a slave)
>
> asd has joined the cluster with qwe -- OK.
>
> In my tests I need to stop/start a cluster node -- qwe, which is a
> master for my test queues. I use /usr/sbin/rabbitmqctl {stop|start}_app
> for it -- everything is OK.
>
> In order to test slave promotion, I first stop the master (qwe), then
> after some time I start it, so that it now becomes a slave.
>
> At the end of the test I stop asd, then start it, so that qwe takes
> queues mastership back over.
>
> During my test, the cluster serves two clients: a message producer and
> a message consumer, running some low rate communication through the
> slave node (asd).
>
>
> Now after a couple of tests, when attempting to do start_app on asd, I
> get (after some pause):
>
> Starting node 'rabbit at asd' ...
> Error: {cannot_start_application,rabbit,
> {bad_return,
> {{rabbit,start,[normal,[]]},
> {'EXIT',{rabbit,failure_during_boot}}}}}
>
>
>
> cluster_status on qwe says:
>
> Cluster status of node 'rabbit at qwe' ...
> [{nodes,[{disc,['rabbit at qwe']},{ram,['rabbit at asd']}]},
> {running_nodes,['rabbit at qwe']}]
> ...done.
>
>
> And cluster_status on asd says:
>
> Cluster status of node 'rabbit at asd' ...
> [{nodes,[{unknown,['rabbit at asd']}]},{running_nodes,[]}]
> ...done.
>
> Now I want to remove asd from the cluster... An attempt to run
> stop_app/reset on asd gives (after some pause as well):
>
> Resetting node 'rabbit at asd' ...
> Error: {timeout_waiting_for_tables,[gm_group]}
>
>
> In this situation I can only throw the entire cluster away and create
> a new one...
>
> How can I recover from this situation?
>
>
> Thanks,
> Markiyan.
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss



More information about the rabbitmq-discuss mailing list