[rabbitmq-discuss] AutoHeal not working after yanking network cable

Simon MacMullen simon at rabbitmq.com
Fri Aug 30 17:53:56 BST 2013


There was definitely a bug in autoheal fixed in 3.1.1, but I'm not aware 
of anything since then. However it's possible some other bug that we 
have fixed is causing your problems with autoheal.

So:

1) You might as well try 3.1.5.
2) Are there any crashes in the logs on the minority node?

Cheers, Simon

On 30/08/2013 4:26PM, Chris wrote:
> Hi All,
>
> As part of our testing of failovers, we yank the network cable on a
> machine (to simulate a switch going down).  When we plug it back in,
> RabbitMQ goes into the network partition mode.  At first we were using
> the default ('ignore') option for dealing with partitions, but it caused
> problems.
>
> After that we put the nodes into 'autoheal' mode.  This did not improve
> things.  Not only did the minority node not rejoin the partition, but it
> refused to restart without manually killing the process.  It also caused
> problems on the other nodes (in the majority).  They stopped accepting
> connections and I couldn't even log into the web UI.  So clearly,
> 'autoheal' didn't seem to work as intended.
>
> We're using RabbitMQ 3.1.1.  Is there anything fixed since then that
> might help with our situation?  Our end goal is to have everything
> working again without intervention.  I understand that this could cause
> *some* data loss during the autoheal process, but this is probably OK.
>   We'd love just to get all three nodes happy again without having to
> manually restart any nodes.
>
> Thanks,
> Chris
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>

-- 
Simon MacMullen
RabbitMQ, Pivotal


More information about the rabbitmq-discuss mailing list