[rabbitmq-discuss] A three-node cluster hangs completely in ec2
jondokulil at gmail.com
Fri Aug 23 00:26:30 BST 2013
@Alvaro - thanks for the link, that's an interesting article.
On Thursday, August 22, 2013 2:10:10 PM UTC-7, Alvaro Videla wrote:
> As a comment to Chris answer: Instagram uses RabbitMQ HA across
> availability zones: https://twitter.com/rbranson/status/310461932618534913
> More details about their setup here:
>> Hi Jon,
>> I'm not 100% familiar with Amazon's availability zones and how they work,
>> but... it sounds to me like they are in different locations and different
>> networks? If so, clustering is probably not a good idea in this case.
>> See: http://www.rabbitmq.com/partitions.html
>> I don't know if this is the cause for the issues you've seen, but it may
>> be the cause of issues in the future... On the other hand, if I am wrong
>> about availabity zones, then you can safely disregard this message! ;-)
>> > wrote:
>>> We've seen this happen twice now and each time it's been a pain to work
>>> around (we ended up creating a whole new cluster each time). Here's the
>>> scenario we have seen:
>>> Our setup:
>>> 1. Three RabbitMQ 3.1.5 nodes running on the Amazon Linux AMI. Each
>>> node is in a different availability zone in the US-EAST region on AWS.
>>> We'll call them nodes A, B, and C
>>> 2. Each queue is using an HA policy
>>> 3. All queues are durable
>>> 4. We Basic.Publish with DeliveryMode=2
>>> 5. All clients are initially connected to node A
>>> The scenario:
>>> 1. Node A is shutdown (the last time I did it via 'sudo
>>> /etc/init.d/rabbitmq-server stop
>>> 2. All connected clients see the shutdown and successfully
>>> transition to using one of the other nodes. About half connect to node B
>>> and the other half connect to node C
>>> 3. We notice that a few of the queues still show their "node" as
>>> being node A, even though it is not currently running.
>>> 4. Node A is brought back online. The RabbitMQ management console
>>> (webapp) shows everything is fine on the homepage.
>>> 5. When A comes back online, those queues that show A as their
>>> 'node' now show zero mirrors.
>>> 6. I attempt to delete the queue via the management webapp. At that
>>> point all three nodes become 100% unresponsive. The management webapp fails
>>> to respond and all communication in our application stops. CPU fluctuates
>>> between 10-40% on but memory doesn't seem to be leaking. It's difficult to
>>> know what is happening because rabbitmqctl is also unresponsive. Attempts
>>> to gracefully stop the nodes all hang.
>>> Does anybody have experience with this? What additional information
>>> should I provide? It's causing a lot of stress and confuses the heck out of
>>> me. Any guidance is much appreciated.
>>> rabbitmq-discuss mailing list
>> rabbitmq-discuss mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the rabbitmq-discuss