[rabbitmq-discuss] Outage with 3-node RabbitMQ 3.1.3 Cluster

Thu Nov 7 11:47:52 GMT 2013

Hi Matt,

On 6 Nov 2013, at 16:02, Matt Wise wrote:
> "Note that the cluster configuration is applied only to fresh nodes. A fresh nodes is a node which has just been reset or is being start for the first time. Thus, the automatic clustering won't take place after restarts of nodes. This means that any change to the clustering via rabbitmqctl will take precedence over the automatic clustering configuration."
> 
> 
> So far we've taken the approach that clustering configuration should be hard-coded into the rabbitmq.config files. This works well in explicitly defining all of the hosts in a cluster on every machine, but it also means that adding a 4th node to a 3-node cluster will cause the 3 running live nodes to do a full service restart, which is bad.

That's not strictly necessary - you can add nodes to a cluster without 

> Our rabbitmq.config though is identical on all of the machines (other than the server-list, which may have been in-flux when Puppet was restarting these services)
> 
> [
>         {rabbit, [
>                 {log_levels, [{connection, warning}]},
>                 {cluster_partition_handling,pause_minority},
>                 {tcp_listeners, [ 5672 ] },
>                 {ssl_listeners, [ 5673 ] },
>                 {ssl_options, [{cacertfile,"/etc/rabbitmq/ssl/cacert.pem"},
>                         {certfile,"/etc/rabbitmq/ssl/cert.pem"},
>                         {keyfile,"/etc/rabbitmq/ssl/key.pem"},
>                         {verify,verify_peer},
>                         {fail_if_no_peer_cert,true}
>                 ]},
>                 {cluster_nodes,['rabbit at i-23cf477b', 'rabbit at i-07d8bc5f', 'rabbit at i-a3291cf8']}
>         ]}
> ].
>  
> > Questions:
> >   1. We only had ~2500 messages in the queues (they are HA'd and durable). The policy is { 'ha-mode': 'all' }. When serverA and serverB restarted, why did they never come up? Unfortunately in the restart process, they blew away their log files as well which makes this really tough to troubleshoot.
> 
> It's nigh on impossible to guess what might've gone wrong without any log files to verify against. We could sit and stare at all the relevant code for weeks and not spot a bug that's been triggered here, since if it were obvious we would've fixed it already.
> 
> If you can give us a very precise set of steps (and timings) that led to this situation, I can try and replicate what you've seen, but I don't fancy my chances to be honest.
> 
> Its a tough one for us to reproduce.. but I think the closest steps would be:
> 
>   1. Create a 3-node cluster... configured with similar config to the one I pasted above.
>   2. Create enough publishers and subscribers that you have a few hundred messages/sec going through the three machines.
>   3. On MachineA and MachineC, remove MachineB from the config file.
>   4. Restart MachineA's rabbitmq daemon using init script
>   5. Wait 3 minutes... theoretically #4 is still in process.. now issue the same restart to MachineC.
> 
>   Fail.
> 

We will take a look at that.

> Thats our best guess right now.. but agreed, the logs are a problem. Can we configure RabbitMQ to log through syslog for the future?
> 

Yes, by replacing the standard OTP logging mechanism with lager, and open source logging framework from Basho Technologies. I have developed a simple plugin (see https://github.com/hyperthunk/rabbitmq-lager) that does this for you. See the README for that repository for further details, and README at https://github.com/basho/lager for information on routing to syslog. You can get a binary of the plugin, compiled against R14B03 from https://raw.github.com/hyperthunk/rabbitmq-lager/binary-dist/lager-2.0.0.ez, though you'll need to compile from source if you're running a newer erlang than that. 

> 
> >
> >   2. I know that restarting serverA and serverB at nearly the same time is obviously a bad idea -- we'll be implementing some changes so this doesn't happen again -- but could this have lead to data corruption?
> 
> It's possible, though obviously that shouldn't really happen. How close were the restarts to one another? How many HA queues were mirrored across these nodes, and were they all very busy (as your previous comment about load seems to suggest)? We could try replicating that scenario in our tests, though it's not easy to get the timing right and obviously the existence of network infrastructure on which the nodes are running won't be the same (and that can make a surprisingly big difference IME).
> 
> The restarts were within a few minutes of each other. There are 5 queues, and all 5 queues are set to mirror to 'all' nodes in the cluster. They were busy, but no more than maybe 100 messages/sec coming in/out. 
>  

I'll take that into account when trying to reproduce - thanks.

> 
> > Once the entire RabbitMQ farm was shut down, we actually were forced to move the rabbitmq data directory out of the way and start up the farm completely with blank databases. It seemed that RabbitMQ 3.1.3 really did not want to recover from this failure. Any thoughts?
> >
> >   3. Lastly .. in the event of future failures, what tools are there for recovering our Mnesia databases? Is there any way we can dump out the data into some raw form, and then import it back into a new fresh cluster?
> >
> 
> I'm afraid there are not, at least not "off the shelf" ones anyway. If you are desperate to recover important production data however, I'm sure we could explore the possibility of trying to help with that somehow. Let me know and I'll make some enquiries at this end.
> 
> At this point we can move on from the data loss... but it does make for an interesting issue. Having tools to analyze the Mnesia DB and get "most of" the messages out in some format where they could be re-injected into a fresh cluster would be an incredibly useful tool. I wonder how hard it is to do?

The messages are not stored in mnesia - we have a "proprietary" on-disk message store. There is a tool that can be used to interact with an offline message store, but it's bit-rotted now and was never fully supported anyway. If a customer does encounter message loss in production, we can offer commercial support to try and resolve the issue, though obviously we're trying very hard to ensure this never happens.

Cheers,
Tim

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131107/af51c695/attachment.htm>