<div dir="ltr">Hi Jon,<div><br></div><div>I&#39;m not 100% familiar with Amazon&#39;s availability zones and how they work, but... it sounds to me like they are in different locations and different networks?  If so, clustering is probably not a good idea in this case.  See: <a href="http://www.rabbitmq.com/partitions.html">http://www.rabbitmq.com/partitions.html</a></div>

<div><br></div><div>I don&#39;t know if this is the cause for the issues you&#39;ve seen, but it may be the cause of issues in the future...  On the other hand, if I am wrong about availabity zones, then you can safely disregard this message! ;-)</div>

<div><br></div><div>-Chris</div><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Aug 22, 2013 at 3:17 PM, Jon Dokulil <span dir="ltr">&lt;<a href="mailto:jondokulil@gmail.com" target="_blank">jondokulil@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">We&#39;ve seen this happen twice now and each time it&#39;s been a pain to work around (we ended up creating a whole new cluster each time). Here&#39;s the scenario we have seen:<div>

<br></div><div>Our setup:</div><div><ol><li><span style="line-height:normal">Three RabbitMQ 3.1.5 nodes running on the Amazon Linux AMI. Each node is in a different availability zone in the US-EAST region on AWS. We&#39;ll call them nodes A, B, and C</span></li>

<li><span style="line-height:normal">Each queue is using an HA policy</span></li><li><span style="line-height:normal">All queues are durable</span></li><li><span style="line-height:normal">We Basic.Publish with DeliveryMode=2</span></li>

<li><span style="line-height:normal">All clients are initially connected to node A</span></li></ol><div>The scenario:</div></div><div><ol><li><span style="line-height:normal">Node A is shutdown (the last time I did it via &#39;sudo /etc/init.d/rabbitmq-server stop</span></li>

<li><span style="line-height:normal">All connected clients see the shutdown and successfully transition to using one of the other nodes. About half connect to node B and the other half connect to node C</span></li><li><span style="line-height:normal">We notice that a few of the queues still show their &quot;node&quot; as being node A, even though it is not currently running.</span></li>

<li><span style="line-height:normal">Node A is brought back online. The RabbitMQ management console (webapp) shows everything is fine on the homepage.</span></li><li><span style="line-height:normal">When A comes back online, those queues that show A as their &#39;node&#39; now show zero mirrors.</span></li>

<li><span style="line-height:normal">I attempt to delete the queue via the management webapp. At that point all three nodes become 100% unresponsive. The management webapp fails to respond and all communication in our application stops. CPU fluctuates between 10-40% on but memory doesn&#39;t seem to be leaking. It&#39;s difficult to know what is happening because rabbitmqctl is also unresponsive. Attempts to gracefully stop the nodes all hang.</span></li>

</ol><div>Does anybody have experience with this? What additional information should I provide? It&#39;s causing a lot of stress and confuses the heck out of me. Any guidance is much appreciated.</div></div><div><br></div>

</div><br>_______________________________________________<br>

rabbitmq-discuss mailing list<br>

<a href="mailto:rabbitmq-discuss@lists.rabbitmq.com">rabbitmq-discuss@lists.rabbitmq.com</a><br>

<a href="https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss" target="_blank">https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss</a><br>

<br></blockquote></div><br></div>