[rabbitmq-discuss] Cluster unresponsive after some stop/start of a node.
Markiyan Kushnir
markiyan.kushnir at gmail.com
Mon May 21 14:37:16 BST 2012
2012/5/21 Francesco Mazzoli <francesco at rabbitmq.com>:
>> Hello Francesco,
>>
>> I'm attaching full logs from qwe (qwe.tgz) and asd (asd.tgz) of a
>> minimal test case.
>>
>> Test.
>>
>> 1. Both qwe and asd are up and running, serving clients.
>> 2. In the middle of qwe (master) accepting messages do stop_app on qwe.
>> 3. Wait until asd is promoted, then start_app on qwe.
>> 4. Do stop_app on asd.
>> 5. Wait, do start_app on asd. qwe is now the master back.
>> 6. Repeat 2,3,4,5 again. See "ERROR REPORT" (rabbit\@asd.log) on asd.
>
>
> I could not reproduce this on my local machine.
>
> I don't think it has anything to do with HA queues, as the log indicates
> `asd' is having problems contacting `qwe' when started. I would expect
> seeing a timeout error like yours if `asd' was a disc node and `qwe' was
> down, but if `asd' is a RAM node I'd expect a different error. What version
> of RabbitMQ are you running? It might be that on earlier versions we had
> looser checks for standalone RAM nodes.
>
> In any case, I would never expect that to happen if at least one node in the
> cluster is up at all times, which seems to be the case here. Can you make
> sure that the error shows up with those precise steps, making sure that the
> connection between the two nodes is not severed - and without bothering with
> HA queues and publishing/consuming.
>
>
>> On your comment on standalone RAM nodes, to be presice, asd (which is
>> really a RAM node) may be left running standalone for some time while
>> qwe is doing stop_app/start_app cycle. In my test, both qwe and asd
>> are never down at the same time.
>
>
> Having a standalone RAM node is a very bad idea, and we actively try to
> prevent the user from creating that situation when we can. I would strongly
> advise you against doing that, unless you have good reasons to (but I doubt
> it).
>
> Francesco.
Re-run my test with two separate modifications:
- modulo clients' activities -- the issue didn't show up. After some
dozen of test runs I could hit it.
- asd changed from RAM to disk (and clients were busy with their
communication as in the original test) -- asd went unresponsive after
a couple test runs. One difference is that both start_app on asd adn
stop_app on qwe freeze, so that I have to CTRL-C them.
After the tests (start_app/stop_app commands halt), cluster_status
still shows info on both nodes:
13:12:~$ sudo /usr/sbin/rabbitmqctl cluster_status
Cluster status of node 'rabbit at qwe' ...
[{nodes,[{disc,['rabbit at asd','rabbit at qwe']}]},
{running_nodes,['rabbit at asd','rabbit at qwe']}]
...done.
13:12:~$ sudo /usr/sbin/rabbitmqctl cluster_status
Cluster status of node 'rabbit at asd' ...
[{nodes,[{disc,['rabbit at asd','rabbit at qwe']}]},
{running_nodes,['rabbit at qwe','rabbit at asd']}]
...done.
Here is the status output on both nodes:
13:18:~$ sudo /usr/sbin/rabbitmqctl status
Status of node 'rabbit at qwe' ...
[{pid,6018},
{running_applications,
[{rabbitmq_management,"RabbitMQ Management Console","2.8.2"},
{rabbitmq_management_agent,"RabbitMQ Management Agent","2.8.2"},
{rabbit,"RabbitMQ","2.8.2"},
{mnesia,"MNESIA CXC 138 12","4.4.12"},
{os_mon,"CPO CXC 138 46","2.2.4"},
{xmerl,"XML parser","1.2.3"},
{amqp_client,"RabbitMQ AMQP Client","2.8.2"},
{sasl,"SASL CXC 138 11","2.1.8"},
{rabbitmq_mochiweb,"RabbitMQ Mochiweb Embedding","2.8.2"},
{webmachine,"webmachine","1.7.0-rmq2.8.2-hg"},
{mochiweb,"MochiMedia Web Server","1.3-rmq2.8.2-git"},
{inets,"INETS CXC 138 49","5.2"},
{stdlib,"ERTS CXC 138 10","1.16.4"},
{kernel,"ERTS CXC 138 10","2.13.4"}]},
{os,{unix,linux}},
{erlang_version,
"Erlang R13B03 (erts-5.7.4) [source] [64-bit] [rq:1]
[async-threads:30] [hipe] [kernel-poll:true]\n"},
{memory,
[{total,36949080},
{processes,14130744},
{processes_used,14117816},
{system,22818336},
{atom,1519633},
{atom_used,1498588},
{binary,274160},
{code,18291044},
{ets,1218344}]},
{vm_memory_high_watermark,0.3999999994254929},
{vm_memory_limit,417749401},
{disk_free_limit,1044373504},
{disk_free,76397264896},
{file_descriptors,
[{total_limit,924},{total_used,4},{sockets_limit,829},{sockets_used,1}]},
{processes,[{limit,1048576},{used,227}]},
{run_queue,0},
{uptime,2334}]
...done.
13:27:~$ sudo /usr/sbin/rabbitmqctl status
Status of node 'rabbit at asd' ...
[{pid,9804},
{running_applications,
[{mnesia,"MNESIA CXC 138 12","4.4.12"},
{os_mon,"CPO CXC 138 46","2.2.4"},
{xmerl,"XML parser","1.2.3"},
{amqp_client,"RabbitMQ AMQP Client","2.8.2"},
{sasl,"SASL CXC 138 11","2.1.8"},
{rabbitmq_mochiweb,"RabbitMQ Mochiweb Embedding","2.8.2"},
{webmachine,"webmachine","1.7.0-rmq2.8.2-hg"},
{mochiweb,"MochiMedia Web Server","1.3-rmq2.8.2-git"},
{inets,"INETS CXC 138 49","5.2"},
{stdlib,"ERTS CXC 138 10","1.16.4"},
{kernel,"ERTS CXC 138 10","2.13.4"}]},
{os,{unix,linux}},
{erlang_version,
"Erlang R13B03 (erts-5.7.4) [source] [64-bit] [rq:1]
[async-threads:30] [hipe] [kernel-poll:true]\n"},
{memory,
[{total,33949376},
{processes,11465800},
{processes_used,11346720},
{system,22483576},
{atom,1519633},
{atom_used,1498775},
{binary,94768},
{code,18291044},
{ets,1137896}]},
{file_descriptors,
[{total_limit,924},{total_used,0},{sockets_limit,829},{sockets_used,0}]},
{processes,[{limit,1048576},{used,105}]},
{run_queue,0},
{uptime,2483}]
After another attempt to stop the node on qwe, status freeze on this
node as well.
In the case you need it -- attaching new set of logs on both qwe and asd.
--
Thanks,
Markiyan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: asd-2.tgz
Type: application/x-gzip
Size: 3225 bytes
Desc: not available
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120521/db00bd39/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: qwe-2.tgz
Type: application/x-gzip
Size: 3627 bytes
Desc: not available
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120521/db00bd39/attachment-0001.bin>
More information about the rabbitmq-discuss
mailing list