[rabbitmq-discuss] Cluster unresponsive after some stop/start of a node.

Mon May 21 12:17:43 BST 2012

> Hello Francesco,
>
> I'm attaching full logs from qwe (qwe.tgz) and asd (asd.tgz) of a
> minimal test case.
>
> Test.
>
> 1. Both qwe and asd are up and running, serving clients.
> 2. In the middle of qwe (master) accepting messages do stop_app on qwe.
> 3. Wait until asd is promoted, then start_app on qwe.
> 4. Do stop_app on asd.
> 5. Wait, do start_app on asd. qwe is now the master back.
> 6. Repeat 2,3,4,5 again. See "ERROR REPORT" (rabbit\@asd.log) on asd.

I could not reproduce this on my local machine.

I don't think it has anything to do with HA queues, as the log indicates 
`asd' is having problems contacting `qwe' when started. I would expect 
seeing a timeout error like yours if `asd' was a disc node and `qwe' was 
down, but if `asd' is a RAM node I'd expect a different error. What 
version of RabbitMQ are you running? It might be that on earlier 
versions we had looser checks for standalone RAM nodes.

In any case, I would never expect that to happen if at least one node in 
the cluster is up at all times, which seems to be the case here. Can you 
make sure that the error shows up with those precise steps, making sure 
that the connection between the two nodes is not severed - and without 
bothering with HA queues and publishing/consuming.

> On your comment on standalone RAM nodes, to be presice, asd (which is
> really a RAM node) may be left running standalone for some time while
> qwe is doing stop_app/start_app cycle. In my test, both qwe and asd
> are never down at the same time.

Having a standalone RAM node is a very bad idea, and we actively try to 
prevent the user from creating that situation when we can. I would 
strongly advise you against doing that, unless you have good reasons to 
(but I doubt it).

Francesco.