[rabbitmq-discuss] One node in a cluster never fully starts up

Thu Jul 26 10:40:11 BST 2012

Hi Matt,

At Wed, 25 Jul 2012 11:48:56 -0700,
Matt Pietrek wrote:
> We have a 3 node cluster (mq1, mq2, mq3) running 2.8.4 supporting a small
> number of HA queues. During startup of the cluster, we start all nodes in
> parallel.

This is not a good idea when dealing with clustering.  RabbitMQ clustering is
basically a thin layer over mnesia clustering, and we need to do some additional
bookkeeping that is prone to race conditions (e.g. storing the online nodes at
shutdown).  We are putting efforts in making this process more reliable on the
rabbit side.

For this reason you should always execute clustering operations sequentially.

> Usually everything works fine. However, we've just recently seen one of the
> nodes (mq3) won't start, i.e., the rabbitmqctl wait <pid> doesn't complete.
> 
> I can log in to the management UI on mq1 and mq2, so they're at least
> minimally running.
> 
> Luckily, we've turned on verbose Mnesia logging. here's what the failing node
> (mq3) shows in the console spew:
>
> [...]
>
> The pattern of "Getting table rabbit_durable_exchange (disc_copies) from node
> rabbit at mq1:" cycles between mq1 and mq2 repeatedly until I kill mq3.

Uhm.  It looks like mnesia is detecting a deadlock, and I'm not sure why.  What
happens if you don't kill it?  Does it terminate by itself, eventually?

> What other sort of information can I provide or look for when this situation
> repeats?

Well, the normal rabbit logs would help.

--
Francesco * Often in error, never in doubt