[rabbitmq-discuss] Cluster unresponsive after some stop/start of a node.

Mon May 21 14:41:27 BST 2012

2012/5/21 Markiyan Kushnir <markiyan.kushnir at gmail.com>:
> 2012/5/21 Francesco Mazzoli <francesco at rabbitmq.com>:
>>> Hello Francesco,
>>>
>>> I'm attaching full logs from qwe (qwe.tgz) and asd (asd.tgz) of a
>>> minimal test case.
>>>
>>> Test.
>>>
>>> 1. Both qwe and asd are up and running, serving clients.
>>> 2. In the middle of qwe (master) accepting messages do stop_app on qwe.
>>> 3. Wait until asd is promoted, then start_app on qwe.
>>> 4. Do stop_app on asd.
>>> 5. Wait, do start_app on asd. qwe is now the master back.
>>> 6. Repeat 2,3,4,5 again. See "ERROR REPORT" (rabbit\@asd.log) on asd.
>>
>>
>> I could not reproduce this on my local machine.
>>
>> I don't think it has anything to do with HA queues, as the log indicates
>> `asd' is having problems contacting `qwe' when started. I would expect
>> seeing a timeout error like yours if `asd' was a disc node and `qwe' was
>> down, but if `asd' is a RAM node I'd expect a different error. What version
>> of RabbitMQ are you running? It might be that on earlier versions we had
>> looser checks for standalone RAM nodes.
>>
>> In any case, I would never expect that to happen if at least one node in the
>> cluster is up at all times, which seems to be the case here. Can you make
>> sure that the error shows up with those precise steps, making sure that the
>> connection between the two nodes is not severed - and without bothering with
>> HA queues and publishing/consuming.
>>
>>
>>> On your comment on standalone RAM nodes, to be presice, asd (which is
>>> really a RAM node) may be left running standalone for some time while
>>> qwe is doing stop_app/start_app cycle. In my test, both qwe and asd
>>> are never down at the same time.
>>
>>
>> Having a standalone RAM node is a very bad idea, and we actively try to
>> prevent the user from creating that situation when we can. I would strongly
>> advise you against doing that, unless you have good reasons to (but I doubt
>> it).
>>
>> Francesco.
>
> Re-run my test with two separate modifications:
>
>  - modulo clients' activities -- the issue didn't show up. After some
> dozen of test runs I could hit it.
>

s/could hit it/couldn't hit it/

Sorry for the typo.

>  - asd changed from RAM to disk (and clients were busy with their
> communication as in the original test) -- asd went unresponsive after
> a couple test runs. One difference is that both start_app on asd adn
> stop_app on qwe freeze, so that I have to CTRL-C them.
>
> After the tests (start_app/stop_app commands halt), cluster_status
> still shows info on both nodes:
>
>
> 13:12:~$ sudo /usr/sbin/rabbitmqctl cluster_status
> Cluster status of node 'rabbit at qwe' ...
> [{nodes,[{disc,['rabbit at asd','rabbit at qwe']}]},
>  {running_nodes,['rabbit at asd','rabbit at qwe']}]
>  ...done.
>
> 13:12:~$ sudo /usr/sbin/rabbitmqctl cluster_status
> Cluster status of node 'rabbit at asd' ...
> [{nodes,[{disc,['rabbit at asd','rabbit at qwe']}]},
>  {running_nodes,['rabbit at qwe','rabbit at asd']}]
>  ...done.
>
>
> Here is the status output on both nodes:
>
> 13:18:~$ sudo /usr/sbin/rabbitmqctl status
> Status of node 'rabbit at qwe' ...
> [{pid,6018},
>  {running_applications,
>     [{rabbitmq_management,"RabbitMQ Management Console","2.8.2"},
>      {rabbitmq_management_agent,"RabbitMQ Management Agent","2.8.2"},
>      {rabbit,"RabbitMQ","2.8.2"},
>      {mnesia,"MNESIA  CXC 138 12","4.4.12"},
>      {os_mon,"CPO  CXC 138 46","2.2.4"},
>      {xmerl,"XML parser","1.2.3"},
>      {amqp_client,"RabbitMQ AMQP Client","2.8.2"},
>      {sasl,"SASL  CXC 138 11","2.1.8"},
>      {rabbitmq_mochiweb,"RabbitMQ Mochiweb Embedding","2.8.2"},
>      {webmachine,"webmachine","1.7.0-rmq2.8.2-hg"},
>      {mochiweb,"MochiMedia Web Server","1.3-rmq2.8.2-git"},
>      {inets,"INETS  CXC 138 49","5.2"},
>      {stdlib,"ERTS  CXC 138 10","1.16.4"},
>      {kernel,"ERTS  CXC 138 10","2.13.4"}]},
>  {os,{unix,linux}},
>  {erlang_version,
>     "Erlang R13B03 (erts-5.7.4) [source] [64-bit] [rq:1]
> [async-threads:30] [hipe] [kernel-poll:true]\n"},
>  {memory,
>     [{total,36949080},
>      {processes,14130744},
>      {processes_used,14117816},
>      {system,22818336},
>      {atom,1519633},
>      {atom_used,1498588},
>      {binary,274160},
>      {code,18291044},
>      {ets,1218344}]},
>  {vm_memory_high_watermark,0.3999999994254929},
>  {vm_memory_limit,417749401},
>  {disk_free_limit,1044373504},
>  {disk_free,76397264896},
>  {file_descriptors,
>     [{total_limit,924},{total_used,4},{sockets_limit,829},{sockets_used,1}]},
>  {processes,[{limit,1048576},{used,227}]},
>  {run_queue,0},
>  {uptime,2334}]
> ...done.
>
> 13:27:~$ sudo /usr/sbin/rabbitmqctl status
> Status of node 'rabbit at asd' ...
> [{pid,9804},
>  {running_applications,
>     [{mnesia,"MNESIA  CXC 138 12","4.4.12"},
>      {os_mon,"CPO  CXC 138 46","2.2.4"},
>      {xmerl,"XML parser","1.2.3"},
>      {amqp_client,"RabbitMQ AMQP Client","2.8.2"},
>      {sasl,"SASL  CXC 138 11","2.1.8"},
>      {rabbitmq_mochiweb,"RabbitMQ Mochiweb Embedding","2.8.2"},
>      {webmachine,"webmachine","1.7.0-rmq2.8.2-hg"},
>      {mochiweb,"MochiMedia Web Server","1.3-rmq2.8.2-git"},
>      {inets,"INETS  CXC 138 49","5.2"},
>      {stdlib,"ERTS  CXC 138 10","1.16.4"},
>      {kernel,"ERTS  CXC 138 10","2.13.4"}]},
>  {os,{unix,linux}},
>  {erlang_version,
>     "Erlang R13B03 (erts-5.7.4) [source] [64-bit] [rq:1]
> [async-threads:30] [hipe] [kernel-poll:true]\n"},
>  {memory,
>     [{total,33949376},
>      {processes,11465800},
>      {processes_used,11346720},
>      {system,22483576},
>      {atom,1519633},
>      {atom_used,1498775},
>      {binary,94768},
>      {code,18291044},
>      {ets,1137896}]},
>  {file_descriptors,
>     [{total_limit,924},{total_used,0},{sockets_limit,829},{sockets_used,0}]},
>  {processes,[{limit,1048576},{used,105}]},
>  {run_queue,0},
>  {uptime,2483}]
>
>
>
> After another attempt to stop the node on qwe, status freeze on this
> node as well.
>
> In the case you need it -- attaching new set of logs on both qwe and asd.
>
> --
> Thanks,
> Markiyan.