[rabbitmq-discuss] Cluster hung on node death

Fri Jun 27 20:55:33 BST 2014

Today we had the physical machine of a dedicated node kernel panic (linux 
centos 6)... when that happened the other two nodes in the cluster seemed 
to choke, and not respond at all.

"rabbitmqctl cluster_status" on either of the other nodes would hang.

The web management UI didn't respond.  I could get a login page to come up 
but after that it would go back to not responding.

When the crashed machine came back up, without starting rabbitmq on it, 
once networking was responding, the other two nodes seemed to free up and 
start operating normally again.

After the rest of the cluster was operating normally again, we brought down 
the crashed machine to do a memtest, and we didn't experience the cluster 
freeze again (rabbitmq was not ever started back up on the failed node).

This cluster (we went live with multiple clusters yesterday), is running 3 
physical dedicated machines.  All of them are on centos 6.  RabbitMQ v 
3.3.2.  All nodes are disc nodes.  All queues are durable and mirrored. 
 This cluster has one queue, plus 1000's of dynamic shovels (which of 
course includes their own queues on this cluster) connecting to queues on 3 
other clusters with similar setups.  Each node has about 7gig of disk free 
on the relevant partition, and 48gig of ram with the high_water_mark set to 
0.9, but even at diminished capacity right now, the most ram used is 1.2gig 
on one node and 600meg on the other (these boxes were way over built with 
short-term growth in mind).

Sadly, there was nothing in the logs.  We realized this might be related to 
the logging bug fixed in 3.3.3, so we just upgraded our dev environment to 
start the process to deal with that.

Any thoughts on what the cause of this freeze up could have been?  And how 
to mitigate it?  Or any troubleshooting / information gathering we could do 
if it happens again?  It's a scary thing now to have happen on a friday 
afternoon.  We were counting on three node clusters getting us through if 
there was an outage of a node during the weekend... but now we're all 
afraid to go home for the weekend!

Thanks!
Dan.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140627/bf44396b/attachment.html>