[rabbitmq-discuss] A three-node cluster hangs completely in ec2

Thu Aug 22 21:20:58 BST 2013

Hi Jon,

I'm not 100% familiar with Amazon's availability zones and how they work,
but... it sounds to me like they are in different locations and different
networks?  If so, clustering is probably not a good idea in this case.
 See: http://www.rabbitmq.com/partitions.html

I don't know if this is the cause for the issues you've seen, but it may be
the cause of issues in the future...  On the other hand, if I am wrong
about availabity zones, then you can safely disregard this message! ;-)

-Chris

On Thu, Aug 22, 2013 at 3:17 PM, Jon Dokulil <jondokulil at gmail.com> wrote:

> We've seen this happen twice now and each time it's been a pain to work
> around (we ended up creating a whole new cluster each time). Here's the
> scenario we have seen:
>
> Our setup:
>
>    1. Three RabbitMQ 3.1.5 nodes running on the Amazon Linux AMI. Each
>    node is in a different availability zone in the US-EAST region on AWS.
>    We'll call them nodes A, B, and C
>    2. Each queue is using an HA policy
>    3. All queues are durable
>    4. We Basic.Publish with DeliveryMode=2
>    5. All clients are initially connected to node A
>
> The scenario:
>
>    1. Node A is shutdown (the last time I did it via 'sudo
>    /etc/init.d/rabbitmq-server stop
>    2. All connected clients see the shutdown and successfully transition
>    to using one of the other nodes. About half connect to node B and the other
>    half connect to node C
>    3. We notice that a few of the queues still show their "node" as being
>    node A, even though it is not currently running.
>    4. Node A is brought back online. The RabbitMQ management console
>    (webapp) shows everything is fine on the homepage.
>    5. When A comes back online, those queues that show A as their 'node'
>    now show zero mirrors.
>    6. I attempt to delete the queue via the management webapp. At that
>    point all three nodes become 100% unresponsive. The management webapp fails
>    to respond and all communication in our application stops. CPU fluctuates
>    between 10-40% on but memory doesn't seem to be leaking. It's difficult to
>    know what is happening because rabbitmqctl is also unresponsive. Attempts
>    to gracefully stop the nodes all hang.
>
> Does anybody have experience with this? What additional information should
> I provide? It's causing a lot of stress and confuses the heck out of me.
> Any guidance is much appreciated.
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130822/b172cf2a/attachment.htm>