[rabbitmq-discuss] Outage with 3-node RabbitMQ 3.1.3 Cluster

Wed Nov 6 10:37:41 GMT 2013

Hi Matt,

Sorry to hear you've been running into problems.

On 5 Nov 2013, at 22:05, Matt Wise wrote:

> (sorry if this gets posted twice.. first email never seemed to make it to the list)
> 
> Hey... I had a pretty rough time today with a 3-node RabbitMQ 3.1.3 cluster thats under pretty heavy use (6-7 million messages per day -- 100MB peak bandwidth per node). I want to pose a few questions here. First off, here's the basic configuration though.
> 
> Configuration:
>   serverA, serverB and serverC are all configured with RabbitMQ 3.1.3. They each are configured via Puppet ... and Puppet uses a dynamic node discovery plugin (zookeeper) to find the nodes. The node lists are hard-coded into the rabbitmq.config file. A dynamic server list generator supplies Puppet with this list of servers (and is not really necessary to describe here in this email).
> 
> Scenario:
>   A momentary configuration blip caused serverA and serverB to begin reconfiguring their rabbitmq.config files... when they did this, they also both issued a 'service rabbitmq restart' command. This command took 40+minutes and ultimately failed. During this failure, RabbitMQ was technically running and accepting connections to the TCP ports ... but it would not actually answer any queries. Commands like list_queues would hang indefinitely.
> 

What ha recovery policy (if any) do you have set up? A and B might get a different "view of the world" set up in their respective rabbitmq.config files (either to each other and/or to C) and then get restarted, but this should affect their view of the cluster, because as per http://www.rabbitmq.com/clustering.html:

"Note that the cluster configuration is applied only to fresh nodes. A fresh nodes is a node which has just been reset or is being start for the first time. Thus, the automatic clustering won't take place after restarts of nodes. This means that any change to the clustering via rabbitmqctl will take precedence over the automatic clustering configuration."

> Questions:
>   1. We only had ~2500 messages in the queues (they are HA'd and durable). The policy is { 'ha-mode': 'all' }. When serverA and serverB restarted, why did they never come up? Unfortunately in the restart process, they blew away their log files as well which makes this really tough to troubleshoot.

It's nigh on impossible to guess what might've gone wrong without any log files to verify against. We could sit and stare at all the relevant code for weeks and not spot a bug that's been triggered here, since if it were obvious we would've fixed it already.

If you can give us a very precise set of steps (and timings) that led to this situation, I can try and replicate what you've seen, but I don't fancy my chances to be honest.

> 
>   2. I know that restarting serverA and serverB at nearly the same time is obviously a bad idea -- we'll be implementing some changes so this doesn't happen again -- but could this have lead to data corruption?

It's possible, though obviously that shouldn't really happen. How close were the restarts to one another? How many HA queues were mirrored across these nodes, and were they all very busy (as your previous comment about load seems to suggest)? We could try replicating that scenario in our tests, though it's not easy to get the timing right and obviously the existence of network infrastructure on which the nodes are running won't be the same (and that can make a surprisingly big difference IME). 

> Once the entire RabbitMQ farm was shut down, we actually were forced to move the rabbitmq data directory out of the way and start up the farm completely with blank databases. It seemed that RabbitMQ 3.1.3 really did not want to recover from this failure. Any thoughts?
> 
>   3. Lastly .. in the event of future failures, what tools are there for recovering our Mnesia databases? Is there any way we can dump out the data into some raw form, and then import it back into a new fresh cluster?
> 

I'm afraid there are not, at least not "off the shelf" ones anyway. If you are desperate to recover important production data however, I'm sure we could explore the possibility of trying to help with that somehow. Let me know and I'll make some enquiries at this end.

Cheers,
Tim