[rabbitmq-discuss] A three-node cluster hangs completely in ec2
matthias at rabbitmq.com
Fri Aug 23 00:41:35 BST 2013
On 22/08/13 20:17, Jon Dokulil wrote:
> We've seen this happen twice now and each time it's been a pain to work
> around (we ended up creating a whole new cluster each time). Here's the
> scenario we have seen:
> Our setup:
> 1. Three RabbitMQ 3.1.5 nodes running on the Amazon Linux AMI. Each
> node is in a different availability zone in the US-EAST region on
> AWS. We'll call them nodes A, B, and C
> 2. Each queue is using an HA policy
> 3. All queues are durable
> 4. We Basic.Publish with DeliveryMode=2
> 5. All clients are initially connected to node A
> The scenario:
> 1. Node A is shutdown (the last time I did it via 'sudo
> /etc/init.d/rabbitmq-server stop
> 2. All connected clients see the shutdown and successfully transition
> to using one of the other nodes. About half connect to node B and
> the other half connect to node C
> 3. We notice that a few of the queues still show their "node" as being
> node A, even though it is not currently running.
> 4. Node A is brought back online. The RabbitMQ management console
> (webapp) shows everything is fine on the homepage.
> 5. When A comes back online, those queues that show A as their 'node'
> now show zero mirrors.
> 6. I attempt to delete the queue via the management webapp. At that
> point all three nodes become 100% unresponsive. The management
> webapp fails to respond and all communication in our application
> stops. CPU fluctuates between 10-40% on but memory doesn't seem to
> be leaking. It's difficult to know what is happening because
> rabbitmqctl is also unresponsive. Attempts to gracefully stop the
> nodes all hang.
Is this reproducible at will?
If so, please grab the RabbitMQ log files (both the normal and the sasl
log) from all nodes, post them and indicate at what time you performed
each of the above steps, so we can correlate that with log entries.
More information about the rabbitmq-discuss