[rabbitmq-discuss] A three-node cluster hangs completely in ec2
Liam Reilly
liam.reilly.1 at gmail.com
Fri Aug 23 09:39:30 BST 2013
That's a great article!
On Thursday, 22 August 2013 22:10:10 UTC+1, Alvaro Videla wrote:
>
> As a comment to Chris answer: Instagram uses RabbitMQ HA across
> availability zones: https://twitter.com/rbranson/status/310461932618534913
>
> More details about their setup here:
> http://blogs.vmware.com/vfabric/2013/04/how-instagram-feeds-work-celery-and-rabbitmq.html
>
>
> On Thu, Aug 22, 2013 at 10:20 PM, Chris <st... at moesel.net <javascript:>>wrote:
>
>> Hi Jon,
>>
>> I'm not 100% familiar with Amazon's availability zones and how they work,
>> but... it sounds to me like they are in different locations and different
>> networks? If so, clustering is probably not a good idea in this case.
>> See: http://www.rabbitmq.com/partitions.html
>>
>> I don't know if this is the cause for the issues you've seen, but it may
>> be the cause of issues in the future... On the other hand, if I am wrong
>> about availabity zones, then you can safely disregard this message! ;-)
>>
>> -Chris
>>
>>
>>
>> On Thu, Aug 22, 2013 at 3:17 PM, Jon Dokulil <jondo... at gmail.com<javascript:>
>> > wrote:
>>
>>> We've seen this happen twice now and each time it's been a pain to work
>>> around (we ended up creating a whole new cluster each time). Here's the
>>> scenario we have seen:
>>>
>>> Our setup:
>>>
>>> 1. Three RabbitMQ 3.1.5 nodes running on the Amazon Linux AMI. Each
>>> node is in a different availability zone in the US-EAST region on AWS.
>>> We'll call them nodes A, B, and C
>>> 2. Each queue is using an HA policy
>>> 3. All queues are durable
>>> 4. We Basic.Publish with DeliveryMode=2
>>> 5. All clients are initially connected to node A
>>>
>>> The scenario:
>>>
>>> 1. Node A is shutdown (the last time I did it via 'sudo
>>> /etc/init.d/rabbitmq-server stop
>>> 2. All connected clients see the shutdown and successfully
>>> transition to using one of the other nodes. About half connect to node B
>>> and the other half connect to node C
>>> 3. We notice that a few of the queues still show their "node" as
>>> being node A, even though it is not currently running.
>>> 4. Node A is brought back online. The RabbitMQ management console
>>> (webapp) shows everything is fine on the homepage.
>>> 5. When A comes back online, those queues that show A as their
>>> 'node' now show zero mirrors.
>>> 6. I attempt to delete the queue via the management webapp. At that
>>> point all three nodes become 100% unresponsive. The management webapp fails
>>> to respond and all communication in our application stops. CPU fluctuates
>>> between 10-40% on but memory doesn't seem to be leaking. It's difficult to
>>> know what is happening because rabbitmqctl is also unresponsive. Attempts
>>> to gracefully stop the nodes all hang.
>>>
>>> Does anybody have experience with this? What additional information
>>> should I provide? It's causing a lot of stress and confuses the heck out of
>>> me. Any guidance is much appreciated.
>>>
>>>
>>> _______________________________________________
>>> rabbitmq-discuss mailing list
>>> rabbitmq... at lists.rabbitmq.com <javascript:>
>>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>>
>>>
>>
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> rabbitmq... at lists.rabbitmq.com <javascript:>
>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130823/3d8adf83/attachment.htm>
More information about the rabbitmq-discuss
mailing list