<div dir="ltr">Hi Jon,<div><br></div><div>I'm not 100% familiar with Amazon's availability zones and how they work, but... it sounds to me like they are in different locations and different networks? If so, clustering is probably not a good idea in this case. See: <a href="http://www.rabbitmq.com/partitions.html">http://www.rabbitmq.com/partitions.html</a></div>
<div><br></div><div>I don't know if this is the cause for the issues you've seen, but it may be the cause of issues in the future... On the other hand, if I am wrong about availabity zones, then you can safely disregard this message! ;-)</div>
<div><br></div><div>-Chris</div><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Aug 22, 2013 at 3:17 PM, Jon Dokulil <span dir="ltr"><<a href="mailto:jondokulil@gmail.com" target="_blank">jondokulil@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">We've seen this happen twice now and each time it's been a pain to work around (we ended up creating a whole new cluster each time). Here's the scenario we have seen:<div>
<br></div><div>Our setup:</div><div><ol><li><span style="line-height:normal">Three RabbitMQ 3.1.5 nodes running on the Amazon Linux AMI. Each node is in a different availability zone in the US-EAST region on AWS. We'll call them nodes A, B, and C</span></li>
<li><span style="line-height:normal">Each queue is using an HA policy</span></li><li><span style="line-height:normal">All queues are durable</span></li><li><span style="line-height:normal">We Basic.Publish with DeliveryMode=2</span></li>
<li><span style="line-height:normal">All clients are initially connected to node A</span></li></ol><div>The scenario:</div></div><div><ol><li><span style="line-height:normal">Node A is shutdown (the last time I did it via 'sudo /etc/init.d/rabbitmq-server stop</span></li>
<li><span style="line-height:normal">All connected clients see the shutdown and successfully transition to using one of the other nodes. About half connect to node B and the other half connect to node C</span></li><li><span style="line-height:normal">We notice that a few of the queues still show their "node" as being node A, even though it is not currently running.</span></li>
<li><span style="line-height:normal">Node A is brought back online. The RabbitMQ management console (webapp) shows everything is fine on the homepage.</span></li><li><span style="line-height:normal">When A comes back online, those queues that show A as their 'node' now show zero mirrors.</span></li>
<li><span style="line-height:normal">I attempt to delete the queue via the management webapp. At that point all three nodes become 100% unresponsive. The management webapp fails to respond and all communication in our application stops. CPU fluctuates between 10-40% on but memory doesn't seem to be leaking. It's difficult to know what is happening because rabbitmqctl is also unresponsive. Attempts to gracefully stop the nodes all hang.</span></li>
</ol><div>Does anybody have experience with this? What additional information should I provide? It's causing a lot of stress and confuses the heck out of me. Any guidance is much appreciated.</div></div><div><br></div>
</div><br>_______________________________________________<br>
rabbitmq-discuss mailing list<br>
<a href="mailto:rabbitmq-discuss@lists.rabbitmq.com">rabbitmq-discuss@lists.rabbitmq.com</a><br>
<a href="https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss" target="_blank">https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss</a><br>
<br></blockquote></div><br></div>