[rabbitmq-discuss] active/active HA cluster appears to be deadlocked

Tue Sep 6 21:26:54 BST 2011

I created a 3 server RabbitMQ v2.6.0 cluster with all queues
configured for "x-ha-policy": "all".  Everything seemed to be working
fine, and then I killed one of the servers (halt).  This is in a
virtualized environment (ec2) so that server is gone.  I wanted to
simulate scenarios where one of the servers failed catastrophically.

The problem is, now my cluster seems to be stuck in some sort of
deadlocked state.  One of my processes *thinks* it posted a message to
the cluster, but I am unable to list_queues to verify this and the
consumer has yet to receive this message.

"rabbitmqctl list_queues" blocks forever and never returns.

"/etc/init.d/rabbitmq-server stop" blocks forever and never actually
shuts down the server.

"rabbitmqctl cluster_status" returns immediately showing me the three
registered nodes with only two listed as running:

Cluster status of node 'rabbit at domU-12-31-39-00-E1-D7' ...
[{nodes,[{disc,['rabbit at ip-10-212-138-134','rabbit at ip-10-90-195-244',
                'rabbit at domU-12-31-39-00-E1-D7']}]},
 {running_nodes,['rabbit at ip-10-212-138-134','rabbit at domU-12-31-39-00-E1-D7']}]
...done.

as you can see, all three nodes were configured as "disk" nodes and
rabbit at ip-10-90-195-244 is no longer running.

So, somehow the cluster seems to be deadlocked.  Since this server
cannot possibly be restored, how do I get out of this state?  Is there
a way I can forcefully tell the rest of the cluster to forget about
the missing server?  I can't find an example of how to do this in the
documentation.

Thanks,
Bryan