[rabbitmq-discuss] HA - missing or incompletely replicated queues

Mon Nov 7 13:50:14 GMT 2011

Hi Ash,

On Mon, Nov 07, 2011 at 01:27:11PM +0000, Ashley Brown wrote:
> There are 3 nodes in the cluster, with queues replicated to all nodes. In
> testing, I've been issuing kill commands to take out beam, rabbit and
> erlang processes, which close the channels and make clients reconnect to a
> different nodes. I allow the downed node to recover and come back up before
> killing another one (also allowing queues to synchronize).

You're aware that there is no eager synchronisation of HA queues, yes?
So it's only by the unsynchronised head of each queue being consumed
that synchronisation occurs.

> After doing this a couple of times, we see the following:
> 
> http://www.evernote.com/shard/s53/sh/448a967e-b995-4f54-986d-50194955550f/416d4e71af91f6e2f6d5311f7ea9fb44
> 
> The classifications queue is gone (taking any messages with it), the meta
> queue is only replicated to one other node. The tracking queue is OK, but
> only because it disappeared and was recreated empty.

Did the previously downed node really come back up and join the cluster
correctly? Does the output of rabbitmqctl cluster_status on each of the
3 nodes report all 3 nodes are running?

> I've also found that sometimes queues stop delivering messages when certain
> nodes go down (even after being left for minutes), despite being in HA mode
> (haven't been able to dig into this more yet).

When this happens, could you check the logs please on all nodes for any
entries regarding the queues. If a node with a queue master goes down
then there should be entries about some slave on another node being
promoted, but even if it's just a slave that dies, there should be
entries in the logs that show others have noticed that.

> Sometimes connections to nodes which have gone down are still shown
> and get stuck. Using netstat > reveals that those connections do not
> exist at TCP level, and using the Web UI to 'Force Close' them
> generates an error (red box saying unable to connect to server -
> however the rest of the UI works fine).

Interesting. That might be a bug in the mgmt plugin. Does rabbitmqctl
list_connections also show such phantom connections?

> This seems like rather odd behaviour, and means we can't put it into
> production. I'm having trouble replicating it, all I know is that after
> cycling nodes a few times it stops working as we'd expect.

My first guess is that the node hasn't actually rejoined properly, and
may require a manual removal of the rabbitmq database directory before
starting the node and explicitly reclustering it. Please note Rabbit
never has claimed to cope with netsplits or partitions and loss of nodes
falls under this category. Thus just restarting the failed Rabbit may
very well not rejoin the cluster and require manual intervention.
Apologies if I'm telling you things you already know, I just want to be
clear.

However, if ctl cluster_status shows that it has, then that may well
indicate some other bug. Log entries, the output of cluster_status, and
the output of rabbitmqctl report will be useful (if large, please feel
free to send off-list to info@).

> Today, while bringing up the cluster from scratch (shutdown all instances,
> wipe mnesia, restart) I've got 3 nodes running, but an HA queue with 1
> master, 2 synced slaves and 1 unsynced slave. Other queues are showing 1
> master and 2 synced slaves as expected. (see
> http://www.evernote.com/shard/s53/sh/b6345885-88d1-4d21-9614-24abda75a1cb/c2a0dd265b39d21f3e8c336c67ced979
> )

Well, drain the "unsynced" queue and it'll become synced. Yes, this is
very much an undesired limitation, and in some cases it might be
problematic enough to make Rabbit's HA as it stands not-fit-for-purpose.
It should be fixed in the future, but it seemed the wrong decision to
delay active-active HA for many further months to ensure it launched
with eager resync.

Best wishes,

Matthew