[rabbitmq-discuss] Cluster unresponsive after some stop/start of a node.

Mon May 21 10:20:36 BST 2012

2012/5/21 Markiyan Kushnir <markiyan.kushnir at gmail.com>:
> 2012/5/20 Francesco Mazzoli <francesco at rabbitmq.com>:
>> Oh, I forgot to give a solution to your problem (if you are actually trying
>> to start a standalone RAM node): start the disc node first and then the RAM
>> node.
>>
>> Francesco.
>>
>>
>> On 20/05/12 11:09, Francesco Mazzoli wrote:
>>>
>>> Hi Markiyan,
>>>
>>> It's hard to tell what happened without any logs, but I'm going to bet
>>> that you tried to start asd while qwe was down. Since asd is a RAM node,
>>> and we don't like standalone RAM nodes, the boot sequence failed.
>>>
>>> If that's not the problem, please provide more precise instruction on
>>> how to reproduce the problem (if you can reproduce it).
>>>
>>> Francesco.
>>>
>
> Hello Francesco,
>
> I'm attaching full logs from qwe (qwe.tgz) and asd (asd.tgz) of a
> minimal test case.
>
> Test.
>
> 1. Both qwe and asd are up and running, serving clients.
> 2. In the middle of qwe (master) accepting messages do stop_app on qwe.
> 3. Wait until asd is promoted, then start_app on qwe.
> 4. Do stop_app on asd.
> 5. Wait, do start_app on asd. qwe is now the master back.
> 6. Repeat 2,3,4,5 again. See "ERROR REPORT" (rabbit\@asd.log) on asd.
>
> Clients.
>
> Two of of my clients are based on rabbitmqadmin.txt to publish
> "commands" and read "replies" (infrastuture). Another one is Python
> based, which both consumes "commands" and publishes "replies" (target
> app). The target app handles basic.cancel coming as a result of slave
> promotion, and re-issues its basic.consume (support of consumer
> cancellation notifications). When the connection to a node is lost due
> to stop_app, the target app tries to re-connect to another node.
>
>
>
> On your comment on standalone RAM nodes, to be presice, asd (which is
> really a RAM node) may be left running standalone for some time while
> qwe is doing stop_app/start_app cycle. In my test, both qwe and asd
> are never down at the same time.
>
> Please letme know if there is anything else that might help.
>

Forgot to mention that qwe was actually listening on port 57524 all
the time (this might be relevant to the issue).

Markiyan.

> Thanks,
> Markiyan.
>
>
>
>>> On 20/05/12 08:08, Markiyan Kushnir wrote:
>>>>
>>>> Here is my setup:
>>>>
>>>> rabbit at qwe is at 10.0.0.1 (initially the master)
>>>> rabbit at asd is at 10.0.0.2 (initially a slave)
>>>>
>>>> asd has joined the cluster with qwe -- OK.
>>>>
>>>> In my tests I need to stop/start a cluster node -- qwe, which is a
>>>> master for my test queues. I use /usr/sbin/rabbitmqctl {stop|start}_app
>>>> for it -- everything is OK.
>>>>
>>>> In order to test slave promotion, I first stop the master (qwe), then
>>>> after some time I start it, so that it now becomes a slave.
>>>>
>>>> At the end of the test I stop asd, then start it, so that qwe takes
>>>> queues mastership back over.
>>>>
>>>> During my test, the cluster serves two clients: a message producer and
>>>> a message consumer, running some low rate communication through the
>>>> slave node (asd).
>>>>
>>>>
>>>> Now after a couple of tests, when attempting to do start_app on asd, I
>>>> get (after some pause):
>>>>
>>>> Starting node 'rabbit at asd' ...
>>>> Error: {cannot_start_application,rabbit,
>>>> {bad_return,
>>>> {{rabbit,start,[normal,[]]},
>>>> {'EXIT',{rabbit,failure_during_boot}}}}}
>>>>
>>>>
>>>>
>>>> cluster_status on qwe says:
>>>>
>>>> Cluster status of node 'rabbit at qwe' ...
>>>> [{nodes,[{disc,['rabbit at qwe']},{ram,['rabbit at asd']}]},
>>>> {running_nodes,['rabbit at qwe']}]
>>>> ...done.
>>>>
>>>>
>>>> And cluster_status on asd says:
>>>>
>>>> Cluster status of node 'rabbit at asd' ...
>>>> [{nodes,[{unknown,['rabbit at asd']}]},{running_nodes,[]}]
>>>> ...done.
>>>>
>>>> Now I want to remove asd from the cluster... An attempt to run
>>>> stop_app/reset on asd gives (after some pause as well):
>>>>
>>>> Resetting node 'rabbit at asd' ...
>>>> Error: {timeout_waiting_for_tables,[gm_group]}
>>>>
>>>>
>>>> In this situation I can only throw the entire cluster away and create
>>>> a new one...
>>>>
>>>> How can I recover from this situation?
>>>>
>>>>
>>>> Thanks,
>>>> Markiyan.
>>>> _______________________________________________
>>>> rabbitmq-discuss mailing list
>>>> rabbitmq-discuss at lists.rabbitmq.com
>>>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>>
>>>
>>> _______________________________________________
>>> rabbitmq-discuss mailing list
>>> rabbitmq-discuss at lists.rabbitmq.com
>>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>
>>