[rabbitmq-discuss] A three-node cluster hangs completely in ec2
Alvaro Videla
videlalvaro at gmail.com
Thu Aug 22 22:10:10 BST 2013
As a comment to Chris answer: Instagram uses RabbitMQ HA across
availability zones: https://twitter.com/rbranson/status/310461932618534913
More details about their setup here:
http://blogs.vmware.com/vfabric/2013/04/how-instagram-feeds-work-celery-and-rabbitmq.html
On Thu, Aug 22, 2013 at 10:20 PM, Chris <stuff at moesel.net> wrote:
> Hi Jon,
>
> I'm not 100% familiar with Amazon's availability zones and how they work,
> but... it sounds to me like they are in different locations and different
> networks? If so, clustering is probably not a good idea in this case.
> See: http://www.rabbitmq.com/partitions.html
>
> I don't know if this is the cause for the issues you've seen, but it may
> be the cause of issues in the future... On the other hand, if I am wrong
> about availabity zones, then you can safely disregard this message! ;-)
>
> -Chris
>
>
>
> On Thu, Aug 22, 2013 at 3:17 PM, Jon Dokulil <jondokulil at gmail.com> wrote:
>
>> We've seen this happen twice now and each time it's been a pain to work
>> around (we ended up creating a whole new cluster each time). Here's the
>> scenario we have seen:
>>
>> Our setup:
>>
>> 1. Three RabbitMQ 3.1.5 nodes running on the Amazon Linux AMI. Each
>> node is in a different availability zone in the US-EAST region on AWS.
>> We'll call them nodes A, B, and C
>> 2. Each queue is using an HA policy
>> 3. All queues are durable
>> 4. We Basic.Publish with DeliveryMode=2
>> 5. All clients are initially connected to node A
>>
>> The scenario:
>>
>> 1. Node A is shutdown (the last time I did it via 'sudo
>> /etc/init.d/rabbitmq-server stop
>> 2. All connected clients see the shutdown and successfully transition
>> to using one of the other nodes. About half connect to node B and the other
>> half connect to node C
>> 3. We notice that a few of the queues still show their "node" as
>> being node A, even though it is not currently running.
>> 4. Node A is brought back online. The RabbitMQ management console
>> (webapp) shows everything is fine on the homepage.
>> 5. When A comes back online, those queues that show A as their 'node'
>> now show zero mirrors.
>> 6. I attempt to delete the queue via the management webapp. At that
>> point all three nodes become 100% unresponsive. The management webapp fails
>> to respond and all communication in our application stops. CPU fluctuates
>> between 10-40% on but memory doesn't seem to be leaking. It's difficult to
>> know what is happening because rabbitmqctl is also unresponsive. Attempts
>> to gracefully stop the nodes all hang.
>>
>> Does anybody have experience with this? What additional information
>> should I provide? It's causing a lot of stress and confuses the heck out of
>> me. Any guidance is much appreciated.
>>
>>
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> rabbitmq-discuss at lists.rabbitmq.com
>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>
>>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130822/cf59667f/attachment.htm>
More information about the rabbitmq-discuss
mailing list