[rabbitmq-discuss] Cluster unresponsive after some stop/start of a node.

Mon May 21 14:37:16 BST 2012

2012/5/21 Francesco Mazzoli <francesco at rabbitmq.com>:
>> Hello Francesco,
>>
>> I'm attaching full logs from qwe (qwe.tgz) and asd (asd.tgz) of a
>> minimal test case.
>>
>> Test.
>>
>> 1. Both qwe and asd are up and running, serving clients.
>> 2. In the middle of qwe (master) accepting messages do stop_app on qwe.
>> 3. Wait until asd is promoted, then start_app on qwe.
>> 4. Do stop_app on asd.
>> 5. Wait, do start_app on asd. qwe is now the master back.
>> 6. Repeat 2,3,4,5 again. See "ERROR REPORT" (rabbit\@asd.log) on asd.
>
>
> I could not reproduce this on my local machine.
>
> I don't think it has anything to do with HA queues, as the log indicates
> `asd' is having problems contacting `qwe' when started. I would expect
> seeing a timeout error like yours if `asd' was a disc node and `qwe' was
> down, but if `asd' is a RAM node I'd expect a different error. What version
> of RabbitMQ are you running? It might be that on earlier versions we had
> looser checks for standalone RAM nodes.
>
> In any case, I would never expect that to happen if at least one node in the
> cluster is up at all times, which seems to be the case here. Can you make
> sure that the error shows up with those precise steps, making sure that the
> connection between the two nodes is not severed - and without bothering with
> HA queues and publishing/consuming.
>
>
>> On your comment on standalone RAM nodes, to be presice, asd (which is
>> really a RAM node) may be left running standalone for some time while
>> qwe is doing stop_app/start_app cycle. In my test, both qwe and asd
>> are never down at the same time.
>
>
> Having a standalone RAM node is a very bad idea, and we actively try to
> prevent the user from creating that situation when we can. I would strongly
> advise you against doing that, unless you have good reasons to (but I doubt
> it).
>
> Francesco.

Re-run my test with two separate modifications:

  - modulo clients' activities -- the issue didn't show up. After some
dozen of test runs I could hit it.

  - asd changed from RAM to disk (and clients were busy with their
communication as in the original test) -- asd went unresponsive after
a couple test runs. One difference is that both start_app on asd adn
stop_app on qwe freeze, so that I have to CTRL-C them.

After the tests (start_app/stop_app commands halt), cluster_status
still shows info on both nodes:

13:12:~$ sudo /usr/sbin/rabbitmqctl cluster_status
Cluster status of node 'rabbit at qwe' ...
[{nodes,[{disc,['rabbit at asd','rabbit at qwe']}]},
 {running_nodes,['rabbit at asd','rabbit at qwe']}]
 ...done.

13:12:~$ sudo /usr/sbin/rabbitmqctl cluster_status
Cluster status of node 'rabbit at asd' ...
[{nodes,[{disc,['rabbit at asd','rabbit at qwe']}]},
 {running_nodes,['rabbit at qwe','rabbit at asd']}]
 ...done.

Here is the status output on both nodes:

13:18:~$ sudo /usr/sbin/rabbitmqctl status
Status of node 'rabbit at qwe' ...
[{pid,6018},
 {running_applications,
     [{rabbitmq_management,"RabbitMQ Management Console","2.8.2"},
      {rabbitmq_management_agent,"RabbitMQ Management Agent","2.8.2"},
      {rabbit,"RabbitMQ","2.8.2"},
      {mnesia,"MNESIA  CXC 138 12","4.4.12"},
      {os_mon,"CPO  CXC 138 46","2.2.4"},
      {xmerl,"XML parser","1.2.3"},
      {amqp_client,"RabbitMQ AMQP Client","2.8.2"},
      {sasl,"SASL  CXC 138 11","2.1.8"},
      {rabbitmq_mochiweb,"RabbitMQ Mochiweb Embedding","2.8.2"},
      {webmachine,"webmachine","1.7.0-rmq2.8.2-hg"},
      {mochiweb,"MochiMedia Web Server","1.3-rmq2.8.2-git"},
      {inets,"INETS  CXC 138 49","5.2"},
      {stdlib,"ERTS  CXC 138 10","1.16.4"},
      {kernel,"ERTS  CXC 138 10","2.13.4"}]},
 {os,{unix,linux}},
 {erlang_version,
     "Erlang R13B03 (erts-5.7.4) [source] [64-bit] [rq:1]
[async-threads:30] [hipe] [kernel-poll:true]\n"},
 {memory,
     [{total,36949080},
      {processes,14130744},
      {processes_used,14117816},
      {system,22818336},
      {atom,1519633},
      {atom_used,1498588},
      {binary,274160},
      {code,18291044},
      {ets,1218344}]},
 {vm_memory_high_watermark,0.3999999994254929},
 {vm_memory_limit,417749401},
 {disk_free_limit,1044373504},
 {disk_free,76397264896},
 {file_descriptors,
     [{total_limit,924},{total_used,4},{sockets_limit,829},{sockets_used,1}]},
 {processes,[{limit,1048576},{used,227}]},
 {run_queue,0},
 {uptime,2334}]
...done.

13:27:~$ sudo /usr/sbin/rabbitmqctl status
Status of node 'rabbit at asd' ...
[{pid,9804},
 {running_applications,
     [{mnesia,"MNESIA  CXC 138 12","4.4.12"},
      {os_mon,"CPO  CXC 138 46","2.2.4"},
      {xmerl,"XML parser","1.2.3"},
      {amqp_client,"RabbitMQ AMQP Client","2.8.2"},
      {sasl,"SASL  CXC 138 11","2.1.8"},
      {rabbitmq_mochiweb,"RabbitMQ Mochiweb Embedding","2.8.2"},
      {webmachine,"webmachine","1.7.0-rmq2.8.2-hg"},
      {mochiweb,"MochiMedia Web Server","1.3-rmq2.8.2-git"},
      {inets,"INETS  CXC 138 49","5.2"},
      {stdlib,"ERTS  CXC 138 10","1.16.4"},
      {kernel,"ERTS  CXC 138 10","2.13.4"}]},
 {os,{unix,linux}},
 {erlang_version,
     "Erlang R13B03 (erts-5.7.4) [source] [64-bit] [rq:1]
[async-threads:30] [hipe] [kernel-poll:true]\n"},
 {memory,
     [{total,33949376},
      {processes,11465800},
      {processes_used,11346720},
      {system,22483576},
      {atom,1519633},
      {atom_used,1498775},
      {binary,94768},
      {code,18291044},
      {ets,1137896}]},
 {file_descriptors,
     [{total_limit,924},{total_used,0},{sockets_limit,829},{sockets_used,0}]},
 {processes,[{limit,1048576},{used,105}]},
 {run_queue,0},
 {uptime,2483}]

After another attempt to stop the node on qwe, status freeze on this
node as well.

In the case you need it -- attaching new set of logs on both qwe and asd.

--
Thanks,
Markiyan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: asd-2.tgz
Type: application/x-gzip
Size: 3225 bytes
Desc: not available
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120521/db00bd39/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: qwe-2.tgz
Type: application/x-gzip
Size: 3627 bytes
Desc: not available
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120521/db00bd39/attachment-0001.bin>