[rabbitmq-discuss] A three-node cluster hangs completely in ec2

Thu Aug 22 22:10:10 BST 2013

As a comment to Chris answer: Instagram uses RabbitMQ HA across
availability zones: https://twitter.com/rbranson/status/310461932618534913

More details about their setup here:
http://blogs.vmware.com/vfabric/2013/04/how-instagram-feeds-work-celery-and-rabbitmq.html

On Thu, Aug 22, 2013 at 10:20 PM, Chris <stuff at moesel.net> wrote:

> Hi Jon,
>
> I'm not 100% familiar with Amazon's availability zones and how they work,
> but... it sounds to me like they are in different locations and different
> networks?  If so, clustering is probably not a good idea in this case.
>  See: http://www.rabbitmq.com/partitions.html
>
> I don't know if this is the cause for the issues you've seen, but it may
> be the cause of issues in the future...  On the other hand, if I am wrong
> about availabity zones, then you can safely disregard this message! ;-)
>
> -Chris
>
>
>
> On Thu, Aug 22, 2013 at 3:17 PM, Jon Dokulil <jondokulil at gmail.com> wrote:
>
>> We've seen this happen twice now and each time it's been a pain to work
>> around (we ended up creating a whole new cluster each time). Here's the
>> scenario we have seen:
>>
>> Our setup:
>>
>>    1. Three RabbitMQ 3.1.5 nodes running on the Amazon Linux AMI. Each
>>    node is in a different availability zone in the US-EAST region on AWS.
>>    We'll call them nodes A, B, and C
>>    2. Each queue is using an HA policy
>>    3. All queues are durable
>>    4. We Basic.Publish with DeliveryMode=2
>>    5. All clients are initially connected to node A
>>
>> The scenario:
>>
>>    1. Node A is shutdown (the last time I did it via 'sudo
>>    /etc/init.d/rabbitmq-server stop
>>    2. All connected clients see the shutdown and successfully transition
>>    to using one of the other nodes. About half connect to node B and the other
>>    half connect to node C
>>    3. We notice that a few of the queues still show their "node" as
>>    being node A, even though it is not currently running.
>>    4. Node A is brought back online. The RabbitMQ management console
>>    (webapp) shows everything is fine on the homepage.
>>    5. When A comes back online, those queues that show A as their 'node'
>>    now show zero mirrors.
>>    6. I attempt to delete the queue via the management webapp. At that
>>    point all three nodes become 100% unresponsive. The management webapp fails
>>    to respond and all communication in our application stops. CPU fluctuates
>>    between 10-40% on but memory doesn't seem to be leaking. It's difficult to
>>    know what is happening because rabbitmqctl is also unresponsive. Attempts
>>    to gracefully stop the nodes all hang.
>>
>> Does anybody have experience with this? What additional information
>> should I provide? It's causing a lot of stress and confuses the heck out of
>> me. Any guidance is much appreciated.
>>
>>
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> rabbitmq-discuss at lists.rabbitmq.com
>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>
>>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130822/cf59667f/attachment.htm>