Francesco,<br><br>Thanks for the quick reply. A couple of replies/questions:<br><br>If I&#39;m understanding what you&#39;re saying, we should be starting up our brokers sequentially. However, in my experience this hasn&#39;t worked. For instance, we&#39;ve seen mq1 stall in its startup, waiting for mq3 to start. But mq3 can&#39;t start (per the sequential logic) till mq1 finishes starting up. Per advice I received from you previously (below) we&#39;ve moved to async startup of the brokers:<br>

<br><a href="http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2012-June/020689.html">http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2012-June/020689.html</a><br><br><pre>&gt;<i> Question 2

</i>&gt;<i> ---------------

</i>&gt;<i> Related to the above scenario, is there any danger (after an unplanned

</i>&gt;<i> shutdown), in simply letting all the nodes start in parallel and

</i>&gt;<i> letting Mnesia&#39;s waiting sort out the order? It seems to work OK in my

</i>&gt;<i> limited testing so far, but I don&#39;t know if we&#39;re risking data loss.

</i>

It should be fine, but in general it&#39;s better to do cluster operations

sequentially and at one site. In this specific case it should be OK.<br><br></pre>As it stands now, we&#39;re in a catch 22 - If we do sequential startup, we run the risk of deadlocking if we start the nodes in the wrong order. But if we do async startup, we run into the problem described in this thread.<br>

<br>--------<br>&gt; Uhm.  It looks like mnesia is detecting a deadlock, and I&#39;m not sure why.  What<br>

&gt; happens if you don&#39;t kill it?  Does it terminate by itself, eventually?<br><br>I&#39;ve let it wait for a good long time (30 minutes +) before killing it.<br><br>Thanks much for your help,<br><br>Matt<br><br><div class="gmail_quote">

On Thu, Jul 26, 2012 at 2:40 AM, Francesco Mazzoli <span dir="ltr">&lt;<a href="mailto:francesco@rabbitmq.com" target="_blank">francesco@rabbitmq.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi Matt,<br>

<br>

At Wed, 25 Jul 2012 11:48:56 -0700,<br>

<div class="im">Matt Pietrek wrote:<br>

&gt; We have a 3 node cluster (mq1, mq2, mq3) running 2.8.4 supporting a small<br>

&gt; number of HA queues. During startup of the cluster, we start all nodes in<br>

&gt; parallel.<br>

<br>

</div>This is not a good idea when dealing with clustering.  RabbitMQ clustering is<br>

basically a thin layer over mnesia clustering, and we need to do some additional<br>

bookkeeping that is prone to race conditions (e.g. storing the online nodes at<br>

shutdown).  We are putting efforts in making this process more reliable on the<br>

rabbit side.<br>

<br>

For this reason you should always execute clustering operations sequentially.<br>

<div class="im"><br>

&gt; Usually everything works fine. However, we&#39;ve just recently seen one of the<br>

&gt; nodes (mq3) won&#39;t start, i.e., the rabbitmqctl wait &lt;pid&gt; doesn&#39;t complete.<br>

&gt;<br>

&gt; I can log in to the management UI on mq1 and mq2, so they&#39;re at least<br>

&gt; minimally running.<br>

&gt;<br>

&gt; Luckily, we&#39;ve turned on verbose Mnesia logging. here&#39;s what the failing node<br>

&gt; (mq3) shows in the console spew:<br>

&gt;<br>

</div>&gt; [...]<br>

<div class="im">&gt;<br>

&gt; The pattern of &quot;Getting table rabbit_durable_exchange (disc_copies) from node<br>

&gt; rabbit@mq1:&quot; cycles between mq1 and mq2 repeatedly until I kill mq3.<br>

<br>

</div>Uhm.  It looks like mnesia is detecting a deadlock, and I&#39;m not sure why.  What<br>

happens if you don&#39;t kill it?  Does it terminate by itself, eventually?<br>

<div class="im"><br>

&gt; What other sort of information can I provide or look for when this situation<br>

&gt; repeats?<br>

<br>

</div>Well, the normal rabbit logs would help.<br>

<br>

--<br>

Francesco * Often in error, never in doubt<br>

</blockquote></div><br>