[rabbitmq-discuss] Cluster unresponsive after some stop/start of a node.
Markiyan Kushnir
markiyan.kushnir at gmail.com
Mon May 21 14:41:27 BST 2012
2012/5/21 Markiyan Kushnir <markiyan.kushnir at gmail.com>:
> 2012/5/21 Francesco Mazzoli <francesco at rabbitmq.com>:
>>> Hello Francesco,
>>>
>>> I'm attaching full logs from qwe (qwe.tgz) and asd (asd.tgz) of a
>>> minimal test case.
>>>
>>> Test.
>>>
>>> 1. Both qwe and asd are up and running, serving clients.
>>> 2. In the middle of qwe (master) accepting messages do stop_app on qwe.
>>> 3. Wait until asd is promoted, then start_app on qwe.
>>> 4. Do stop_app on asd.
>>> 5. Wait, do start_app on asd. qwe is now the master back.
>>> 6. Repeat 2,3,4,5 again. See "ERROR REPORT" (rabbit\@asd.log) on asd.
>>
>>
>> I could not reproduce this on my local machine.
>>
>> I don't think it has anything to do with HA queues, as the log indicates
>> `asd' is having problems contacting `qwe' when started. I would expect
>> seeing a timeout error like yours if `asd' was a disc node and `qwe' was
>> down, but if `asd' is a RAM node I'd expect a different error. What version
>> of RabbitMQ are you running? It might be that on earlier versions we had
>> looser checks for standalone RAM nodes.
>>
>> In any case, I would never expect that to happen if at least one node in the
>> cluster is up at all times, which seems to be the case here. Can you make
>> sure that the error shows up with those precise steps, making sure that the
>> connection between the two nodes is not severed - and without bothering with
>> HA queues and publishing/consuming.
>>
>>
>>> On your comment on standalone RAM nodes, to be presice, asd (which is
>>> really a RAM node) may be left running standalone for some time while
>>> qwe is doing stop_app/start_app cycle. In my test, both qwe and asd
>>> are never down at the same time.
>>
>>
>> Having a standalone RAM node is a very bad idea, and we actively try to
>> prevent the user from creating that situation when we can. I would strongly
>> advise you against doing that, unless you have good reasons to (but I doubt
>> it).
>>
>> Francesco.
>
> Re-run my test with two separate modifications:
>
> - modulo clients' activities -- the issue didn't show up. After some
> dozen of test runs I could hit it.
>
s/could hit it/couldn't hit it/
Sorry for the typo.
> - asd changed from RAM to disk (and clients were busy with their
> communication as in the original test) -- asd went unresponsive after
> a couple test runs. One difference is that both start_app on asd adn
> stop_app on qwe freeze, so that I have to CTRL-C them.
>
> After the tests (start_app/stop_app commands halt), cluster_status
> still shows info on both nodes:
>
>
> 13:12:~$ sudo /usr/sbin/rabbitmqctl cluster_status
> Cluster status of node 'rabbit at qwe' ...
> [{nodes,[{disc,['rabbit at asd','rabbit at qwe']}]},
> {running_nodes,['rabbit at asd','rabbit at qwe']}]
> ...done.
>
> 13:12:~$ sudo /usr/sbin/rabbitmqctl cluster_status
> Cluster status of node 'rabbit at asd' ...
> [{nodes,[{disc,['rabbit at asd','rabbit at qwe']}]},
> {running_nodes,['rabbit at qwe','rabbit at asd']}]
> ...done.
>
>
> Here is the status output on both nodes:
>
> 13:18:~$ sudo /usr/sbin/rabbitmqctl status
> Status of node 'rabbit at qwe' ...
> [{pid,6018},
> {running_applications,
> [{rabbitmq_management,"RabbitMQ Management Console","2.8.2"},
> {rabbitmq_management_agent,"RabbitMQ Management Agent","2.8.2"},
> {rabbit,"RabbitMQ","2.8.2"},
> {mnesia,"MNESIA CXC 138 12","4.4.12"},
> {os_mon,"CPO CXC 138 46","2.2.4"},
> {xmerl,"XML parser","1.2.3"},
> {amqp_client,"RabbitMQ AMQP Client","2.8.2"},
> {sasl,"SASL CXC 138 11","2.1.8"},
> {rabbitmq_mochiweb,"RabbitMQ Mochiweb Embedding","2.8.2"},
> {webmachine,"webmachine","1.7.0-rmq2.8.2-hg"},
> {mochiweb,"MochiMedia Web Server","1.3-rmq2.8.2-git"},
> {inets,"INETS CXC 138 49","5.2"},
> {stdlib,"ERTS CXC 138 10","1.16.4"},
> {kernel,"ERTS CXC 138 10","2.13.4"}]},
> {os,{unix,linux}},
> {erlang_version,
> "Erlang R13B03 (erts-5.7.4) [source] [64-bit] [rq:1]
> [async-threads:30] [hipe] [kernel-poll:true]\n"},
> {memory,
> [{total,36949080},
> {processes,14130744},
> {processes_used,14117816},
> {system,22818336},
> {atom,1519633},
> {atom_used,1498588},
> {binary,274160},
> {code,18291044},
> {ets,1218344}]},
> {vm_memory_high_watermark,0.3999999994254929},
> {vm_memory_limit,417749401},
> {disk_free_limit,1044373504},
> {disk_free,76397264896},
> {file_descriptors,
> [{total_limit,924},{total_used,4},{sockets_limit,829},{sockets_used,1}]},
> {processes,[{limit,1048576},{used,227}]},
> {run_queue,0},
> {uptime,2334}]
> ...done.
>
> 13:27:~$ sudo /usr/sbin/rabbitmqctl status
> Status of node 'rabbit at asd' ...
> [{pid,9804},
> {running_applications,
> [{mnesia,"MNESIA CXC 138 12","4.4.12"},
> {os_mon,"CPO CXC 138 46","2.2.4"},
> {xmerl,"XML parser","1.2.3"},
> {amqp_client,"RabbitMQ AMQP Client","2.8.2"},
> {sasl,"SASL CXC 138 11","2.1.8"},
> {rabbitmq_mochiweb,"RabbitMQ Mochiweb Embedding","2.8.2"},
> {webmachine,"webmachine","1.7.0-rmq2.8.2-hg"},
> {mochiweb,"MochiMedia Web Server","1.3-rmq2.8.2-git"},
> {inets,"INETS CXC 138 49","5.2"},
> {stdlib,"ERTS CXC 138 10","1.16.4"},
> {kernel,"ERTS CXC 138 10","2.13.4"}]},
> {os,{unix,linux}},
> {erlang_version,
> "Erlang R13B03 (erts-5.7.4) [source] [64-bit] [rq:1]
> [async-threads:30] [hipe] [kernel-poll:true]\n"},
> {memory,
> [{total,33949376},
> {processes,11465800},
> {processes_used,11346720},
> {system,22483576},
> {atom,1519633},
> {atom_used,1498775},
> {binary,94768},
> {code,18291044},
> {ets,1137896}]},
> {file_descriptors,
> [{total_limit,924},{total_used,0},{sockets_limit,829},{sockets_used,0}]},
> {processes,[{limit,1048576},{used,105}]},
> {run_queue,0},
> {uptime,2483}]
>
>
>
> After another attempt to stop the node on qwe, status freeze on this
> node as well.
>
> In the case you need it -- attaching new set of logs on both qwe and asd.
>
> --
> Thanks,
> Markiyan.
More information about the rabbitmq-discuss
mailing list