[rabbitmq-discuss] Cluster hung on node death
particlesquirrel at gmail.com
Fri Jun 27 20:55:33 BST 2014
Today we had the physical machine of a dedicated node kernel panic (linux
centos 6)... when that happened the other two nodes in the cluster seemed
to choke, and not respond at all.
"rabbitmqctl cluster_status" on either of the other nodes would hang.
The web management UI didn't respond. I could get a login page to come up
but after that it would go back to not responding.
When the crashed machine came back up, without starting rabbitmq on it,
once networking was responding, the other two nodes seemed to free up and
start operating normally again.
After the rest of the cluster was operating normally again, we brought down
the crashed machine to do a memtest, and we didn't experience the cluster
freeze again (rabbitmq was not ever started back up on the failed node).
This cluster (we went live with multiple clusters yesterday), is running 3
physical dedicated machines. All of them are on centos 6. RabbitMQ v
3.3.2. All nodes are disc nodes. All queues are durable and mirrored.
This cluster has one queue, plus 1000's of dynamic shovels (which of
course includes their own queues on this cluster) connecting to queues on 3
other clusters with similar setups. Each node has about 7gig of disk free
on the relevant partition, and 48gig of ram with the high_water_mark set to
0.9, but even at diminished capacity right now, the most ram used is 1.2gig
on one node and 600meg on the other (these boxes were way over built with
short-term growth in mind).
Sadly, there was nothing in the logs. We realized this might be related to
the logging bug fixed in 3.3.3, so we just upgraded our dev environment to
start the process to deal with that.
Any thoughts on what the cause of this freeze up could have been? And how
to mitigate it? Or any troubleshooting / information gathering we could do
if it happens again? It's a scary thing now to have happen on a friday
afternoon. We were counting on three node clusters getting us through if
there was an outage of a node during the weekend... but now we're all
afraid to go home for the weekend!
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the rabbitmq-discuss