[rabbitmq-discuss] One node in a cluster never fully starts up

Fri May 24 00:34:05 BST 2013

Hi Matt!

I haven’t looked at RabbitMQ for a while, so my answers might be outdated and/or
wrong, but since I gave the original answers I’m going to give it a shot :).

> I know that sequentially is the ideal situation, but in the case of an
> uncontrolled shutdown, it doesn't reliably start, as the first node may
> time out waiting for a later node.
> 
> When this happens, starting the nodes in parallel gets past the problem.
> However, my understanding is that this has its own risks. (See my original
> first message in this thread.)
> 
> At the end of the day, we just need a set of scripts that will idempotently
> start/stop the cluster reliably. It's infeasible to expect an operator
> (i.e. not me) to assess the current cluster state and then guess which
> approach to take.
> 
> Has the guidance changed between 2.8.4 and 3.1.1? I know it's basically a
> mnesia issue - I just don't know what improvements have been made since
> 2.8.4.

The improvements consist in the fact that the operations are safer now.  It goes
more or less like this: when the node starts up it asks the other nodes (‘other
nodes’ in this case being the ones in the config file) for the cluster they are
in, and then joins it if the other nodes are consistent in terms of Erlang and
RabbitMQ versions.

This opens up a race condition if the clusters start together: in the startup
sequence they all ask each other out for the cluster, but since they’re still
not fully started that they might all (or some) get inconclusive replies.  Thus
the cluster might not form fully.  There surely are other negative outcomes that
I am misremembering, but since RabbitMQ 3 it should be fairly hard (or at least
harder) to get catastrophic results (e.g. mnesia panics).  This is because in
the past RabbitMQ blindly clustered with the other nodes without really
bothering to check if it made sense to do so.

The situation is a tricky one to solve with an interface similar to the one that
we have now, and within the constraints of mnesia.  One can probably devise
ad-hoc solution if your assumptions are stronger than the ones we make in the
code, e.g. you might check after a bit if the nodes are really clustered and
take action if they are not.

Hopefully the above was helpful, you might want to wait for others for more
precise answers :).

Francesco