[rabbitmq-discuss] Cluster nodes stop/start order can lead to failures

Thu Sep 13 09:28:55 BST 2012

On 12/09/2012 9:43PM, Matt Long wrote:
> Say I have node1 and node2 both running as disc nodes in a cluster
> (there are no other nodes in the cluster). If I stop rabbitmq-server on
> node1 and then stop rabbitmq-server on node2, I'm unable to then start
> rabbitmq-server again on node1...in particular, the start command hangs
> for ~35 seconds before showing FAILED...

That's correct - it's waiting 30s for node2 to become available.

> Is this the expected behavior?

Yes. The point is that node2 could have been running for a long time 
while node1 was down; all sorts of things could have happened. node1 
doesn't know. If node1 were to start and then allow node2 to join later, 
we could end up in a situation where the nodes have diverged and need to 
be merged manually... not good.

> Note that starting node2 after having
> stopped node1 and then node2 works fine; I'm assuming because node2 was
> aware that node1 had went offline prior to its stopping.

Yes.

So the answer is either to start all the nodes in the reverse order to 
how they were stopped, or start them all simultaneously (so that the 30s 
wait period is enough).

Cheers, Simon