[rabbitmq-discuss] Cluster unresponsive after some stop/start of a node.

Mon May 21 10:13:07 BST 2012

2012/5/20 Francesco Mazzoli <francesco at rabbitmq.com>:
> Oh, I forgot to give a solution to your problem (if you are actually trying
> to start a standalone RAM node): start the disc node first and then the RAM
> node.
>
> Francesco.
>
>
> On 20/05/12 11:09, Francesco Mazzoli wrote:
>>
>> Hi Markiyan,
>>
>> It's hard to tell what happened without any logs, but I'm going to bet
>> that you tried to start asd while qwe was down. Since asd is a RAM node,
>> and we don't like standalone RAM nodes, the boot sequence failed.
>>
>> If that's not the problem, please provide more precise instruction on
>> how to reproduce the problem (if you can reproduce it).
>>
>> Francesco.
>>

Hello Francesco,

I'm attaching full logs from qwe (qwe.tgz) and asd (asd.tgz) of a
minimal test case.

Test.

1. Both qwe and asd are up and running, serving clients.
2. In the middle of qwe (master) accepting messages do stop_app on qwe.
3. Wait until asd is promoted, then start_app on qwe.
4. Do stop_app on asd.
5. Wait, do start_app on asd. qwe is now the master back.
6. Repeat 2,3,4,5 again. See "ERROR REPORT" (rabbit\@asd.log) on asd.

Clients.

Two of of my clients are based on rabbitmqadmin.txt to publish
"commands" and read "replies" (infrastuture). Another one is Python
based, which both consumes "commands" and publishes "replies" (target
app). The target app handles basic.cancel coming as a result of slave
promotion, and re-issues its basic.consume (support of consumer
cancellation notifications). When the connection to a node is lost due
to stop_app, the target app tries to re-connect to another node.

On your comment on standalone RAM nodes, to be presice, asd (which is
really a RAM node) may be left running standalone for some time while
qwe is doing stop_app/start_app cycle. In my test, both qwe and asd
are never down at the same time.

Please letme know if there is anything else that might help.

Thanks,
Markiyan.

>> On 20/05/12 08:08, Markiyan Kushnir wrote:
>>>
>>> Here is my setup:
>>>
>>> rabbit at qwe is at 10.0.0.1 (initially the master)
>>> rabbit at asd is at 10.0.0.2 (initially a slave)
>>>
>>> asd has joined the cluster with qwe -- OK.
>>>
>>> In my tests I need to stop/start a cluster node -- qwe, which is a
>>> master for my test queues. I use /usr/sbin/rabbitmqctl {stop|start}_app
>>> for it -- everything is OK.
>>>
>>> In order to test slave promotion, I first stop the master (qwe), then
>>> after some time I start it, so that it now becomes a slave.
>>>
>>> At the end of the test I stop asd, then start it, so that qwe takes
>>> queues mastership back over.
>>>
>>> During my test, the cluster serves two clients: a message producer and
>>> a message consumer, running some low rate communication through the
>>> slave node (asd).
>>>
>>>
>>> Now after a couple of tests, when attempting to do start_app on asd, I
>>> get (after some pause):
>>>
>>> Starting node 'rabbit at asd' ...
>>> Error: {cannot_start_application,rabbit,
>>> {bad_return,
>>> {{rabbit,start,[normal,[]]},
>>> {'EXIT',{rabbit,failure_during_boot}}}}}
>>>
>>>
>>>
>>> cluster_status on qwe says:
>>>
>>> Cluster status of node 'rabbit at qwe' ...
>>> [{nodes,[{disc,['rabbit at qwe']},{ram,['rabbit at asd']}]},
>>> {running_nodes,['rabbit at qwe']}]
>>> ...done.
>>>
>>>
>>> And cluster_status on asd says:
>>>
>>> Cluster status of node 'rabbit at asd' ...
>>> [{nodes,[{unknown,['rabbit at asd']}]},{running_nodes,[]}]
>>> ...done.
>>>
>>> Now I want to remove asd from the cluster... An attempt to run
>>> stop_app/reset on asd gives (after some pause as well):
>>>
>>> Resetting node 'rabbit at asd' ...
>>> Error: {timeout_waiting_for_tables,[gm_group]}
>>>
>>>
>>> In this situation I can only throw the entire cluster away and create
>>> a new one...
>>>
>>> How can I recover from this situation?
>>>
>>>
>>> Thanks,
>>> Markiyan.
>>> _______________________________________________
>>> rabbitmq-discuss mailing list
>>> rabbitmq-discuss at lists.rabbitmq.com
>>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>
>>
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> rabbitmq-discuss at lists.rabbitmq.com
>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: asd.tgz
Type: application/x-gzip
Size: 3471 bytes
Desc: not available
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120521/7049b83c/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: qwe.tgz
Type: application/x-gzip
Size: 2503 bytes
Desc: not available
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120521/7049b83c/attachment-0001.bin>