[rabbitmq-discuss] A three-node cluster hangs completely in ec2

Jon Dokulil jondokulil at gmail.com
Fri Aug 23 00:25:50 BST 2013


@Chris - that is a valid point. AZs are physically separate locations, 
however Amazon did specifically design them to "feel" like the same 
network. They located them close enough geographically (within something 
like 10-50 miles) to achieve normal latency of <1ms. Network partitions are 
a normal part of life in the AWS world.

Either way - in my case I think the fact that I'm spread across three AZs 
isn't relevant. I have a three-node cluster, stopped and then started a 
single node in the cluster, then deleted a queue... which caused the entire 
cluster to become unavailable.

On Thursday, August 22, 2013 2:10:10 PM UTC-7, Alvaro Videla wrote:
>
> As a comment to Chris answer: Instagram uses RabbitMQ HA across 
> availability zones: https://twitter.com/rbranson/status/310461932618534913
>
> More details about their setup here: 
> http://blogs.vmware.com/vfabric/2013/04/how-instagram-feeds-work-celery-and-rabbitmq.html
>
>
> On Thu, Aug 22, 2013 at 10:20 PM, Chris <st... at moesel.net <javascript:>>wrote:
>
>> Hi Jon,
>>
>> I'm not 100% familiar with Amazon's availability zones and how they work, 
>> but... it sounds to me like they are in different locations and different 
>> networks?  If so, clustering is probably not a good idea in this case. 
>>  See: http://www.rabbitmq.com/partitions.html
>>
>> I don't know if this is the cause for the issues you've seen, but it may 
>> be the cause of issues in the future...  On the other hand, if I am wrong 
>> about availabity zones, then you can safely disregard this message! ;-)
>>
>> -Chris
>>
>>
>>
>> On Thu, Aug 22, 2013 at 3:17 PM, Jon Dokulil <jondo... at gmail.com<javascript:>
>> > wrote:
>>
>>> We've seen this happen twice now and each time it's been a pain to work 
>>> around (we ended up creating a whole new cluster each time). Here's the 
>>> scenario we have seen:
>>>
>>> Our setup:
>>>
>>>    1. Three RabbitMQ 3.1.5 nodes running on the Amazon Linux AMI. Each 
>>>    node is in a different availability zone in the US-EAST region on AWS. 
>>>    We'll call them nodes A, B, and C 
>>>    2. Each queue is using an HA policy
>>>    3. All queues are durable
>>>    4. We Basic.Publish with DeliveryMode=2 
>>>    5. All clients are initially connected to node A
>>>
>>> The scenario:
>>>
>>>    1. Node A is shutdown (the last time I did it via 'sudo 
>>>    /etc/init.d/rabbitmq-server stop 
>>>    2. All connected clients see the shutdown and successfully 
>>>    transition to using one of the other nodes. About half connect to node B 
>>>    and the other half connect to node C
>>>    3. We notice that a few of the queues still show their "node" as 
>>>    being node A, even though it is not currently running. 
>>>    4. Node A is brought back online. The RabbitMQ management console 
>>>    (webapp) shows everything is fine on the homepage.
>>>    5. When A comes back online, those queues that show A as their 
>>>    'node' now show zero mirrors. 
>>>    6. I attempt to delete the queue via the management webapp. At that 
>>>    point all three nodes become 100% unresponsive. The management webapp fails 
>>>    to respond and all communication in our application stops. CPU fluctuates 
>>>    between 10-40% on but memory doesn't seem to be leaking. It's difficult to 
>>>    know what is happening because rabbitmqctl is also unresponsive. Attempts 
>>>    to gracefully stop the nodes all hang. 
>>>
>>> Does anybody have experience with this? What additional information 
>>> should I provide? It's causing a lot of stress and confuses the heck out of 
>>> me. Any guidance is much appreciated.
>>>
>>>
>>> _______________________________________________
>>> rabbitmq-discuss mailing list
>>> rabbitmq... at lists.rabbitmq.com <javascript:>
>>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>>
>>>
>>
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> rabbitmq... at lists.rabbitmq.com <javascript:>
>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130822/f1639687/attachment.htm>


More information about the rabbitmq-discuss mailing list