[rabbitmq-discuss] Network partition queue issues

Fri May 9 12:26:11 BST 2014

Hi,

Yesterday we had two unfortunate network partitions on one of our two-node
clusters, about an hour apart. After the second partition, when we restarted
one of the nodes, we encountered some issues:

1) one of the queues disappeared, including any queued messages.
2) two other queues appears hung and doesn't respond to anything.

We've managed to resolve 1) by recreating the queue and bindings and
republishing the messages, but 2) is still a problem, as we can't do anything
with these queues, not even delete them (the management interface and API just
hangs when trying to delete). Any consumers also appears to hang when
interacting with these queues. Restarting the entire cluster also didn't help.

Is there any way to prevent 1), and can we somehow solve 2) without resetting
the entire cluster? We're currently running RabbitMQ 3.2.4. I have logs
available, but I'd rather not post these publicly since there's some sensitive
data in there.

FTR we're using autoheal and all queues on the cluster have ha-mode=all and
ha-sync-mode=automatic. Also, according to heartbeat, the nodes only appears to
have lost contact for a few seconds (nothing seems to be logged during the
second split).

Thanks,
Jon