[rabbitmq-discuss] Outage with 3-node RabbitMQ 3.1.3 Cluster

Matt Wise matt at nextdoor.com
Tue Nov 5 22:05:28 GMT 2013


(sorry if this gets posted twice.. first email never seemed to make it to
the list)

Hey... I had a pretty rough time today with a 3-node RabbitMQ 3.1.3 cluster
thats under pretty heavy use (6-7 million messages per day -- 100MB peak
bandwidth per node). I want to pose a few questions here. First off, here's
the basic configuration though.

Configuration:
  serverA, serverB and serverC are all configured with RabbitMQ 3.1.3. They
each are configured via Puppet ... and Puppet uses a dynamic node discovery
plugin (zookeeper) to find the nodes. The node lists are hard-coded into
the rabbitmq.config file. A dynamic server list generator supplies Puppet
with this list of servers (and is not really necessary to describe here in
this email).

Scenario:
  A momentary configuration blip caused serverA and serverB to begin
reconfiguring their rabbitmq.config files... when they did this, they also
both issued a 'service rabbitmq restart' command. This command took
40+minutes and ultimately failed. During this failure, RabbitMQ was
technically running and accepting connections to the TCP ports ... but it
would not actually answer any queries. Commands like list_queues would hang
indefinitely.

Questions:
  1. We only had ~2500 messages in the queues (they are HA'd and durable).
The policy is { 'ha-mode': 'all' }. When serverA and serverB restarted, why
did they never come up? Unfortunately in the restart process, they blew
away their log files as well which makes this really tough to troubleshoot.

  2. I know that restarting serverA and serverB at nearly the same time is
obviously a bad idea -- we'll be implementing some changes so this doesn't
happen again -- but could this have lead to data corruption? Once the
entire RabbitMQ farm was shut down, we actually were forced to move the
rabbitmq data directory out of the way and start up the farm completely
with blank databases. It seemed that RabbitMQ 3.1.3 really did not want to
recover from this failure. Any thoughts?

  3. Lastly .. in the event of future failures, what tools are there for
recovering our Mnesia databases? Is there any way we can dump out the data
into some raw form, and then import it back into a new fresh cluster?

Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131105/f64c2f36/attachment.htm>


More information about the rabbitmq-discuss mailing list