[rabbitmq-discuss] HA - missing or incompletely replicated queues

Mon Nov 7 13:27:11 GMT 2011

Hi all,

We were hoping to go into production next week with Rabbit's HA queues,
however testing by randomly killing processes has shown some odd behaviour.
I'm not sure whether it's something odd with our setup, or a set of
interacting bugs.

When started, we create three queues:

   - storm-classifications
   - storm-meta
   - storm-tracking

There are 3 nodes in the cluster, with queues replicated to all nodes. In
testing, I've been issuing kill commands to take out beam, rabbit and
erlang processes, which close the channels and make clients reconnect to a
different nodes. I allow the downed node to recover and come back up before
killing another one (also allowing queues to synchronize).

After doing this a couple of times, we see the following:

http://www.evernote.com/shard/s53/sh/448a967e-b995-4f54-986d-50194955550f/416d4e71af91f6e2f6d5311f7ea9fb44

The classifications queue is gone (taking any messages with it), the meta
queue is only replicated to one other node. The tracking queue is OK, but
only because it disappeared and was recreated empty.

The meta queue - despite not having any messages flowing through it in the
test system - sometimes shows that one of the nodes is not synchronised for
minutes after rejoining.

I've also found that sometimes queues stop delivering messages when certain
nodes go down (even after being left for minutes), despite being in HA mode
(haven't been able to dig into this more yet). Sometimes connections to
nodes which have gone down are still shown and get stuck. Using netstat
reveals that those connections do not exist at TCP level, and using the Web
UI to 'Force Close' them generates an error (red box saying unable to
connect to server - however the rest of the UI works fine).

 This seems like rather odd behaviour, and means we can't put it into
production. I'm having trouble replicating it, all I know is that after
cycling nodes a few times it stops working as we'd expect.

Today, while bringing up the cluster from scratch (shutdown all instances,
wipe mnesia, restart) I've got 3 nodes running, but an HA queue with 1
master, 2 synced slaves and 1 unsynced slave. Other queues are showing 1
master and 2 synced slaves as expected. (see
http://www.evernote.com/shard/s53/sh/b6345885-88d1-4d21-9614-24abda75a1cb/c2a0dd265b39d21f3e8c336c67ced979
)

Currently on 2.6.0, Ubuntu 11.04 on EC2.

Quite infuriating, and no idea how to fix it.

A

-- 
Dr Ashley Brown
Chief Architect

e: ashley at spider.io
a: spider.io, 353 The Strand, WC2R 0HS
w: http://spider.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20111107/e14fd1de/attachment.htm>