[rabbitmq-discuss] A three-node cluster hangs completely in ec2

Fri Aug 23 00:26:30 BST 2013

@Alvaro - thanks for the link, that's an interesting article.

On Thursday, August 22, 2013 2:10:10 PM UTC-7, Alvaro Videla wrote:
>
> As a comment to Chris answer: Instagram uses RabbitMQ HA across 
> availability zones: https://twitter.com/rbranson/status/310461932618534913
>
> More details about their setup here: 
> http://blogs.vmware.com/vfabric/2013/04/how-instagram-feeds-work-celery-and-rabbitmq.html
>
>
> On Thu, Aug 22, 2013 at 10:20 PM, Chris <st... at moesel.net <javascript:>>wrote:
>
>> Hi Jon,
>>
>> I'm not 100% familiar with Amazon's availability zones and how they work, 
>> but... it sounds to me like they are in different locations and different 
>> networks?  If so, clustering is probably not a good idea in this case. 
>>  See: http://www.rabbitmq.com/partitions.html
>>
>> I don't know if this is the cause for the issues you've seen, but it may 
>> be the cause of issues in the future...  On the other hand, if I am wrong 
>> about availabity zones, then you can safely disregard this message! ;-)
>>
>> -Chris
>>
>>
>>
>> On Thu, Aug 22, 2013 at 3:17 PM, Jon Dokulil <jondo... at gmail.com<javascript:>
>> > wrote:
>>
>>> We've seen this happen twice now and each time it's been a pain to work 
>>> around (we ended up creating a whole new cluster each time). Here's the 
>>> scenario we have seen:
>>>
>>> Our setup:
>>>
>>>    1. Three RabbitMQ 3.1.5 nodes running on the Amazon Linux AMI. Each 
>>>    node is in a different availability zone in the US-EAST region on AWS. 
>>>    We'll call them nodes A, B, and C 
>>>    2. Each queue is using an HA policy
>>>    3. All queues are durable
>>>    4. We Basic.Publish with DeliveryMode=2 
>>>    5. All clients are initially connected to node A
>>>
>>> The scenario:
>>>
>>>    1. Node A is shutdown (the last time I did it via 'sudo 
>>>    /etc/init.d/rabbitmq-server stop 
>>>    2. All connected clients see the shutdown and successfully 
>>>    transition to using one of the other nodes. About half connect to node B 
>>>    and the other half connect to node C
>>>    3. We notice that a few of the queues still show their "node" as 
>>>    being node A, even though it is not currently running. 
>>>    4. Node A is brought back online. The RabbitMQ management console 
>>>    (webapp) shows everything is fine on the homepage.
>>>    5. When A comes back online, those queues that show A as their 
>>>    'node' now show zero mirrors. 
>>>    6. I attempt to delete the queue via the management webapp. At that 
>>>    point all three nodes become 100% unresponsive. The management webapp fails 
>>>    to respond and all communication in our application stops. CPU fluctuates 
>>>    between 10-40% on but memory doesn't seem to be leaking. It's difficult to 
>>>    know what is happening because rabbitmqctl is also unresponsive. Attempts 
>>>    to gracefully stop the nodes all hang. 
>>>
>>> Does anybody have experience with this? What additional information 
>>> should I provide? It's causing a lot of stress and confuses the heck out of 
>>> me. Any guidance is much appreciated.
>>>
>>>
>>> _______________________________________________
>>> rabbitmq-discuss mailing list
>>> rabbitmq... at lists.rabbitmq.com <javascript:>
>>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>>
>>>
>>
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> rabbitmq... at lists.rabbitmq.com <javascript:>
>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130822/3594ad11/attachment.htm>