[rabbitmq-discuss] active/active HA cluster appears to be deadlocked

Wed Sep 7 12:05:23 BST 2011

Hi Bryan,

Thanks for reporting all of this.

> > On Tue, Sep 6, 2011 at 3:26 PM, Bryan Murphy <bmurphy1976 at gmail.com> wrote:
> >> I created a 3 server RabbitMQ v2.6.0 cluster with all queues
> >> configured for "x-ha-policy": "all".  Everything seemed to be working
> >> fine, and then I killed one of the servers (halt).  This is in a
> >> virtualized environment (ec2) so that server is gone.  I wanted to
> >> simulate scenarios where one of the servers failed catastrophically.
> >>
> >> The problem is, now my cluster seems to be stuck in some sort of
> >> deadlocked state.  One of my processes *thinks* it posted a message to
> >> the cluster, but I am unable to list_queues to verify this and the
> >> consumer has yet to receive this message.
> >>
> >> "rabbitmqctl list_queues" blocks forever and never returns.

list_queues actively calls into each queue process. Thus it requires the
queue process to respond. To me, what it sounds like, is that the node
that went away contained the master of the mirrored queue, and none of
the slaves have noticed yet, thus there has been no promotion of a
slave, and thus list_queues will still try to call the old and dead
master. This raises two interesting and related questions:

1. Why have the slaves not noticed the loss of the master?
2. Why did the list_queues call not fail - if list_queues tries to call
a queue that doesn't exist, it will error.

What this seems to suggest is that Erlang itself has got very confused
about what's going on - it seems to think that the dead node may be
alive. There are some timers in Erlang (the kernel net tick time) that
Erlang uses to try and detect whether a node is dead or alive. How long
did the cluster remain in this state? Presumably some time.

If you can get this to happen again, the log entries of all the nodes
would be very very useful.

> >> "/etc/init.d/rabbitmq-server stop" blocks forever and never actually
> >> shuts down the server.

Yeah, if list_queues can get blocked, then there'll likely be some other
process within Rabbit that's calling to the dead node and has similarly
got blocked and is then blocking the shutdown sequence.

> >> "rabbitmqctl cluster_status" returns immediately showing me the three
> >> registered nodes with only two listed as running:
> >>
> >> Cluster status of node 'rabbit at domU-12-31-39-00-E1-D7' ...
> >> [{nodes,[{disc,['rabbit at ip-10-212-138-134','rabbit at ip-10-90-195-244',
> >>                'rabbit at domU-12-31-39-00-E1-D7']}]},
> >>  {running_nodes,['rabbit at ip-10-212-138-134','rabbit at domU-12-31-39-00-E1-D7']}]
> >> ...done.

That's interesting. That shows that mnesia has realised
rabbit at ip-10-90-195-244 has died, even if the rest of Erlang hasn't.
Most curious.

> > I can now run "rabbitmqclt list_queues" on both nodes in the cluster,
> > but the messages are still not flowing and the node I restarted shows
> > the following error when I run "rabbitmqclt list_queues":
> >
> > =ERROR REPORT==== 6-Sep-2011::20:31:35 ===
> > Discarding message
> > {'$gen_call',{<0.256.0>,#Ref<0.0.0.646>},{info,[name,messages]}} from
> > <0.256.0> to <0.1382.0> in an old incarnation (2) of this node (3)

That is an amazing error. Erlang not only gives every node a name, but
also gives every node a version. But what is being suggested is that a
msg has been received from the local node and to the local node where
the message appears to come from a prior version of this node.

This error seems to come from C code in the source of Erlang and is
invoked on sending a message. I can't get close to explaining exactly
what's going on here - all those pids in that error message are local
pids. There's no remote node involved at all there...

> I've tried to duplicate this and the second time around the cluster is
> behaving like I would have expected.  Unfortunately, I stupidly
> overwrote the old cluster so I can't go back to it for any deeper
> inspection.

Ahh that's a shame. Log entries would have been very very useful. If you
can reproduce this again, please let us know.

> After trawling the archives a bit, it seems that the only way to bring
> a dead node out of the cluster is to create a new node that
> masquerades as the old node.  That seems pretty messy to me.  Am I
> correct about this?  Are there plans to address this in a future
> release?

Yes, you're right about that, and it is messy. Having looked at the code
in this area, there seems to me no reason why we couldn't add that
functionality. I'll raise a bug.

Matthew