[rabbitmq-discuss] Cluster unresponsive after some stop/start of a node.
francesco at rabbitmq.com
Mon May 21 12:17:43 BST 2012
> Hello Francesco,
> I'm attaching full logs from qwe (qwe.tgz) and asd (asd.tgz) of a
> minimal test case.
> 1. Both qwe and asd are up and running, serving clients.
> 2. In the middle of qwe (master) accepting messages do stop_app on qwe.
> 3. Wait until asd is promoted, then start_app on qwe.
> 4. Do stop_app on asd.
> 5. Wait, do start_app on asd. qwe is now the master back.
> 6. Repeat 2,3,4,5 again. See "ERROR REPORT" (rabbit\@asd.log) on asd.
I could not reproduce this on my local machine.
I don't think it has anything to do with HA queues, as the log indicates
`asd' is having problems contacting `qwe' when started. I would expect
seeing a timeout error like yours if `asd' was a disc node and `qwe' was
down, but if `asd' is a RAM node I'd expect a different error. What
version of RabbitMQ are you running? It might be that on earlier
versions we had looser checks for standalone RAM nodes.
In any case, I would never expect that to happen if at least one node in
the cluster is up at all times, which seems to be the case here. Can you
make sure that the error shows up with those precise steps, making sure
that the connection between the two nodes is not severed - and without
bothering with HA queues and publishing/consuming.
> On your comment on standalone RAM nodes, to be presice, asd (which is
> really a RAM node) may be left running standalone for some time while
> qwe is doing stop_app/start_app cycle. In my test, both qwe and asd
> are never down at the same time.
Having a standalone RAM node is a very bad idea, and we actively try to
prevent the user from creating that situation when we can. I would
strongly advise you against doing that, unless you have good reasons to
(but I doubt it).
More information about the rabbitmq-discuss