[rabbitmq-discuss] One node in a cluster never fully starts up

Thu Jul 26 16:45:50 BST 2012

Francesco,

Thanks for the quick reply. A couple of replies/questions:

If I'm understanding what you're saying, we should be starting up our
brokers sequentially. However, in my experience this hasn't worked. For
instance, we've seen mq1 stall in its startup, waiting for mq3 to start.
But mq3 can't start (per the sequential logic) till mq1 finishes starting
up. Per advice I received from you previously (below) we've moved to async
startup of the brokers:

http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2012-June/020689.html

>* Question 2*>* ---------------*>* Related to the above scenario, is there any danger (after an unplanned*>* shutdown), in simply letting all the nodes start in parallel and*>* letting Mnesia's waiting sort out the order? It seems to work OK in my*>* limited testing so far, but I don't know if we're risking data loss.*
It should be fine, but in general it's better to do cluster operations
sequentially and at one site. In this specific case it should be OK.

As it stands now, we're in a catch 22 - If we do sequential startup, we run
the risk of deadlocking if we start the nodes in the wrong order. But if we
do async startup, we run into the problem described in this thread.

--------
> Uhm.  It looks like mnesia is detecting a deadlock, and I'm not sure why.
 What
> happens if you don't kill it?  Does it terminate by itself, eventually?

I've let it wait for a good long time (30 minutes +) before killing it.

Thanks much for your help,

Matt

On Thu, Jul 26, 2012 at 2:40 AM, Francesco Mazzoli
<francesco at rabbitmq.com>wrote:

> Hi Matt,
>
> At Wed, 25 Jul 2012 11:48:56 -0700,
> Matt Pietrek wrote:
> > We have a 3 node cluster (mq1, mq2, mq3) running 2.8.4 supporting a small
> > number of HA queues. During startup of the cluster, we start all nodes in
> > parallel.
>
> This is not a good idea when dealing with clustering.  RabbitMQ clustering
> is
> basically a thin layer over mnesia clustering, and we need to do some
> additional
> bookkeeping that is prone to race conditions (e.g. storing the online
> nodes at
> shutdown).  We are putting efforts in making this process more reliable on
> the
> rabbit side.
>
> For this reason you should always execute clustering operations
> sequentially.
>
> > Usually everything works fine. However, we've just recently seen one of
> the
> > nodes (mq3) won't start, i.e., the rabbitmqctl wait <pid> doesn't
> complete.
> >
> > I can log in to the management UI on mq1 and mq2, so they're at least
> > minimally running.
> >
> > Luckily, we've turned on verbose Mnesia logging. here's what the failing
> node
> > (mq3) shows in the console spew:
> >
> > [...]
> >
> > The pattern of "Getting table rabbit_durable_exchange (disc_copies) from
> node
> > rabbit at mq1:" cycles between mq1 and mq2 repeatedly until I kill mq3.
>
> Uhm.  It looks like mnesia is detecting a deadlock, and I'm not sure why.
>  What
> happens if you don't kill it?  Does it terminate by itself, eventually?
>
> > What other sort of information can I provide or look for when this
> situation
> > repeats?
>
> Well, the normal rabbit logs would help.
>
> --
> Francesco * Often in error, never in doubt
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120726/b93251f7/attachment.htm>