<div dir="ltr">Hi Simon,<div><br></div><div>Upgrading to 3.1.5 seems to have made things better. So either one of the other bug fixes in 3.1.2 - 3.1.5 helped, or I was just unlucky those couple of times we were trying it with 3.1.1. ;-)</div>
<div><br></div><div>Unfortunately I don't have the logs from those previous failed attempts.</div><div><br></div><div>Thanks for the response,</div><div>Chris</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">
On Fri, Aug 30, 2013 at 12:53 PM, Simon MacMullen <span dir="ltr"><<a href="mailto:simon@rabbitmq.com" target="_blank">simon@rabbitmq.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
There was definitely a bug in autoheal fixed in 3.1.1, but I'm not aware of anything since then. However it's possible some other bug that we have fixed is causing your problems with autoheal.<br>
<br>
So:<br>
<br>
1) You might as well try 3.1.5.<br>
2) Are there any crashes in the logs on the minority node?<br>
<br>
Cheers, Simon<div><div class="h5"><br>
<br>
On 30/08/2013 4:26PM, Chris wrote:<br>
</div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5">
Hi All,<br>
<br>
As part of our testing of failovers, we yank the network cable on a<br>
machine (to simulate a switch going down). When we plug it back in,<br>
RabbitMQ goes into the network partition mode. At first we were using<br>
the default ('ignore') option for dealing with partitions, but it caused<br>
problems.<br>
<br>
After that we put the nodes into 'autoheal' mode. This did not improve<br>
things. Not only did the minority node not rejoin the partition, but it<br>
refused to restart without manually killing the process. It also caused<br>
problems on the other nodes (in the majority). They stopped accepting<br>
connections and I couldn't even log into the web UI. So clearly,<br>
'autoheal' didn't seem to work as intended.<br>
<br>
We're using RabbitMQ 3.1.1. Is there anything fixed since then that<br>
might help with our situation? Our end goal is to have everything<br>
working again without intervention. I understand that this could cause<br>
*some* data loss during the autoheal process, but this is probably OK.<br>
We'd love just to get all three nodes happy again without having to<br>
manually restart any nodes.<br>
<br>
Thanks,<br>
Chris<br>
<br>
<br></div></div>
______________________________<u></u>_________________<br>
rabbitmq-discuss mailing list<br>
<a href="mailto:rabbitmq-discuss@lists.rabbitmq.com" target="_blank">rabbitmq-discuss@lists.<u></u>rabbitmq.com</a><br>
<a href="https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss" target="_blank">https://lists.rabbitmq.com/<u></u>cgi-bin/mailman/listinfo/<u></u>rabbitmq-discuss</a><br>
<br><span class="HOEnZb"><font color="#888888">
</font></span></blockquote><span class="HOEnZb"><font color="#888888">
<br>
-- <br>
Simon MacMullen<br>
RabbitMQ, Pivotal<br>
</font></span></blockquote></div><br></div>