[rabbitmq-discuss] AutoHeal not working after yanking network cable

Chris stuff at moesel.net
Fri Aug 30 16:26:41 BST 2013


Hi All,

As part of our testing of failovers, we yank the network cable on a machine
(to simulate a switch going down).  When we plug it back in, RabbitMQ goes
into the network partition mode.  At first we were using the default
('ignore') option for dealing with partitions, but it caused problems.

After that we put the nodes into 'autoheal' mode.  This did not improve
things.  Not only did the minority node not rejoin the partition, but it
refused to restart without manually killing the process.  It also caused
problems on the other nodes (in the majority).  They stopped accepting
connections and I couldn't even log into the web UI.  So clearly,
'autoheal' didn't seem to work as intended.

We're using RabbitMQ 3.1.1.  Is there anything fixed since then that might
help with our situation?  Our end goal is to have everything working again
without intervention.  I understand that this could cause *some* data loss
during the autoheal process, but this is probably OK.  We'd love just to
get all three nodes happy again without having to manually restart any
nodes.

Thanks,
Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130830/d1399474/attachment.htm>


More information about the rabbitmq-discuss mailing list