[rabbitmq-discuss] Cluster unresponsive after some stop/start of a node.

Francesco Mazzoli francesco at rabbitmq.com
Sun May 20 11:11:56 BST 2012


Oh, I forgot to give a solution to your problem (if you are actually 
trying to start a standalone RAM node): start the disc node first and 
then the RAM node.

Francesco.

On 20/05/12 11:09, Francesco Mazzoli wrote:
> Hi Markiyan,
>
> It's hard to tell what happened without any logs, but I'm going to bet
> that you tried to start asd while qwe was down. Since asd is a RAM node,
> and we don't like standalone RAM nodes, the boot sequence failed.
>
> If that's not the problem, please provide more precise instruction on
> how to reproduce the problem (if you can reproduce it).
>
> Francesco.
>
> On 20/05/12 08:08, Markiyan Kushnir wrote:
>> Here is my setup:
>>
>> rabbit at qwe is at 10.0.0.1 (initially the master)
>> rabbit at asd is at 10.0.0.2 (initially a slave)
>>
>> asd has joined the cluster with qwe -- OK.
>>
>> In my tests I need to stop/start a cluster node -- qwe, which is a
>> master for my test queues. I use /usr/sbin/rabbitmqctl {stop|start}_app
>> for it -- everything is OK.
>>
>> In order to test slave promotion, I first stop the master (qwe), then
>> after some time I start it, so that it now becomes a slave.
>>
>> At the end of the test I stop asd, then start it, so that qwe takes
>> queues mastership back over.
>>
>> During my test, the cluster serves two clients: a message producer and
>> a message consumer, running some low rate communication through the
>> slave node (asd).
>>
>>
>> Now after a couple of tests, when attempting to do start_app on asd, I
>> get (after some pause):
>>
>> Starting node 'rabbit at asd' ...
>> Error: {cannot_start_application,rabbit,
>> {bad_return,
>> {{rabbit,start,[normal,[]]},
>> {'EXIT',{rabbit,failure_during_boot}}}}}
>>
>>
>>
>> cluster_status on qwe says:
>>
>> Cluster status of node 'rabbit at qwe' ...
>> [{nodes,[{disc,['rabbit at qwe']},{ram,['rabbit at asd']}]},
>> {running_nodes,['rabbit at qwe']}]
>> ...done.
>>
>>
>> And cluster_status on asd says:
>>
>> Cluster status of node 'rabbit at asd' ...
>> [{nodes,[{unknown,['rabbit at asd']}]},{running_nodes,[]}]
>> ...done.
>>
>> Now I want to remove asd from the cluster... An attempt to run
>> stop_app/reset on asd gives (after some pause as well):
>>
>> Resetting node 'rabbit at asd' ...
>> Error: {timeout_waiting_for_tables,[gm_group]}
>>
>>
>> In this situation I can only throw the entire cluster away and create
>> a new one...
>>
>> How can I recover from this situation?
>>
>>
>> Thanks,
>> Markiyan.
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> rabbitmq-discuss at lists.rabbitmq.com
>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss



More information about the rabbitmq-discuss mailing list