[rabbitmq-discuss] active/active HA cluster appears to be deadlocked

Tue Sep 6 21:34:17 BST 2011

Additional info,

I killed the rabbitmq process on of the servers in the cluster.  I was
then able to run a "rabbitmqctl list_queues" on the remaining server
and verify that there are messages in the queue that have not been
picked up.

I restarted the process, and "/etc/init.d/rabbitmq-server start" never
exited.  I had to ^C to get out, however, the server does appear to
have restarted and rejoined the cluster.

I can now run "rabbitmqclt list_queues" on both nodes in the cluster,
but the messages are still not flowing and the node I restarted shows
the following error when I run "rabbitmqclt list_queues":

=ERROR REPORT==== 6-Sep-2011::20:31:35 ===
Discarding message
{'$gen_call',{<0.256.0>,#Ref<0.0.0.646>},{info,[name,messages]}} from
<0.256.0> to <0.1382.0> in an old incarnation (2) of this node (3)

On Tue, Sep 6, 2011 at 3:26 PM, Bryan Murphy <bmurphy1976 at gmail.com> wrote:
> I created a 3 server RabbitMQ v2.6.0 cluster with all queues
> configured for "x-ha-policy": "all".  Everything seemed to be working
> fine, and then I killed one of the servers (halt).  This is in a
> virtualized environment (ec2) so that server is gone.  I wanted to
> simulate scenarios where one of the servers failed catastrophically.
>
> The problem is, now my cluster seems to be stuck in some sort of
> deadlocked state.  One of my processes *thinks* it posted a message to
> the cluster, but I am unable to list_queues to verify this and the
> consumer has yet to receive this message.
>
> "rabbitmqctl list_queues" blocks forever and never returns.
>
> "/etc/init.d/rabbitmq-server stop" blocks forever and never actually
> shuts down the server.
>
> "rabbitmqctl cluster_status" returns immediately showing me the three
> registered nodes with only two listed as running:
>
> Cluster status of node 'rabbit at domU-12-31-39-00-E1-D7' ...
> [{nodes,[{disc,['rabbit at ip-10-212-138-134','rabbit at ip-10-90-195-244',
>                'rabbit at domU-12-31-39-00-E1-D7']}]},
>  {running_nodes,['rabbit at ip-10-212-138-134','rabbit at domU-12-31-39-00-E1-D7']}]
> ...done.
>
> as you can see, all three nodes were configured as "disk" nodes and
> rabbit at ip-10-90-195-244 is no longer running.
>
> So, somehow the cluster seems to be deadlocked.  Since this server
> cannot possibly be restored, how do I get out of this state?  Is there
> a way I can forcefully tell the rest of the cluster to forget about
> the missing server?  I can't find an example of how to do this in the
> documentation.
>
> Thanks,
> Bryan
>