<div dir="ltr">Because we're sometimes just mean to our software, I wrote a torture test to see how RabbitMQ's Autoheal deal with repeated partitions.<div><br></div><div>In a nutshell, we start with two brokers (3.2.4) in a cluster. I run my test which uses "iptables" to knock out the link between the two brokers and then restore things.</div>
<div><br></div><div>It does this break/fix continuously in a loop. The time between partitions, and the time inside partitions is configurable.</div><div><br></div><div>Using 60 seconds between inducing a partition, and 60 seconds in a partitioned state, I expect that this might be messy - The brokers try to autoheal, and then everything falls apart. However, I'd expect that once I stop my torture and return things back to "normal", that an autoheal will eventually succeed and the brokers will be happily clustered again.</div>
<div><br></div><div>This isn't what happens. Instead, the two brokers essentially ignore each other. Even after waiting for 10+ minutes. I can see each broker, but they each think the other is missing.</div><div><br></div>
<div>Here's a filtered view of the logs, grepping for "Autoheal|Starting|Stopping|Partitions|Winner|Loser":</div><div><br></div><div>
<p class="">rabbit@mq2.log: Autoheal request sent to rabbit@mq1</p><p class="">rabbit@mq2.log: Autoheal: I am the winner, waiting for [rabbit@mq1] to stop</p><p class="">rabbit@mq2.log: Autoheal: I am the winner, waiting additionally for [rabbit@mq1] to stop</p>
<p class="">rabbit@mq1.log: Autoheal request sent to rabbit@mq1</p><p class="">rabbit@mq1.log: Autoheal request received from rabbit@mq1</p><p class="">rabbit@mq1.log: Autoheal decision</p><p class="">rabbit@mq1.log: * Partitions: [[rabbit@mq1],[rabbit@mq2]]</p>
<p class="">rabbit@mq1.log: * Winner: rabbit@mq2</p><p class="">rabbit@mq1.log: * Losers: [rabbit@mq1]</p><p class="">rabbit@mq1.log: Autoheal request received from rabbit@mq2</p><p class="">rabbit@mq1.log: Autoheal decision</p>
<p class="">rabbit@mq1.log: * Partitions: [[rabbit@mq1],[rabbit@mq2]]</p><p class="">rabbit@mq1.log: * Winner: rabbit@mq2</p><p class="">rabbit@mq1.log: * Losers: [rabbit@mq1]</p><p class="">rabbit@mq1.log: Autoheal: we were selected to restart; winner is rabbit@mq2</p>
<p class="">rabbit@mq1.log: Stopping RabbitMQ</p><p class="">rabbit@mq2.log: Autoheal: aborting - rabbit@mq1 went down</p><p class="">rabbit@mq2.log: Autoheal request sent to rabbit@mq1</p><p class="">rabbit@mq2.log: Autoheal: we were selected to restart; winner is rabbit@mq1</p>
<p class="">rabbit@mq2.log: Stopping RabbitMQ</p><p class="">rabbit@mq1.log: Autoheal: aborting - rabbit@mq2 went down</p><p class="">rabbit@mq1.log: Autoheal request sent to rabbit@mq1</p><p class="">rabbit@mq1.log: Autoheal request received from rabbit@mq2</p>
<p class="">rabbit@mq1.log: Autoheal decision</p><p class="">rabbit@mq1.log: * Partitions: [[rabbit@mq1],[rabbit@mq2]]</p><p class="">rabbit@mq1.log: * Winner: rabbit@mq1</p><p class="">rabbit@mq1.log: * Losers: [rabbit@mq2]</p>
<p class="">rabbit@mq1.log: Autoheal request received from rabbit@mq1</p><p class="">rabbit@mq1.log: Autoheal decision</p><p class="">rabbit@mq1.log: * Partitions: [[rabbit@mq1],[rabbit@mq2]]</p><p class="">rabbit@mq1.log: * Winner: rabbit@mq1</p>
<p class="">rabbit@mq1.log: * Losers: [rabbit@mq2]</p><p class="">rabbit@mq1.log: Autoheal: I am the winner, waiting for [rabbit@mq2] to stop</p><p class="">rabbit@mq1.log: Autoheal: I am the winner, waiting additionally for [rabbit@mq2] to stop</p>
<p class="">rabbit@mq2.log: Autoheal: aborting - rabbit@mq1 went down</p><p class="">rabbit@mq2.log: Autoheal request sent to rabbit@mq1</p><p class="">rabbit@mq2.log: Autoheal: we were selected to restart; winner is rabbit@mq1</p>
<p class="">rabbit@mq1.log: Autoheal: aborting - rabbit@mq2 went down</p><p class="">rabbit@mq1.log: Autoheal request sent to rabbit@mq1</p><p class="">rabbit@mq1.log: Autoheal request received from rabbit@mq2</p><p class="">
rabbit@mq1.log: Autoheal decision</p><p class="">rabbit@mq1.log: * Partitions: [[rabbit@mq1],[rabbit@mq2]]</p><p class="">rabbit@mq1.log: * Winner: rabbit@mq1</p><p class="">rabbit@mq1.log: * Losers: [rabbit@mq2]</p>
<p class="">rabbit@mq1.log: Autoheal request received from rabbit@mq1</p><p class="">rabbit@mq1.log: Autoheal decision</p><p class="">rabbit@mq1.log: * Partitions: [[rabbit@mq1],[rabbit@mq2]]</p><p class="">rabbit@mq1.log: * Winner: rabbit@mq1</p>
<p class="">rabbit@mq1.log: * Losers: [rabbit@mq2]</p><p class="">rabbit@mq1.log: Autoheal: I am the winner, waiting for [rabbit@mq2] to stop</p><p class="">
</p><p class="">rabbit@mq1.log: Autoheal: I am the winner, waiting additionally for [rabbit@mq2] to stop</p><p class=""># And nothing else beyond this, even after waiting for 10+ minutes.</p><p class="">I don't ever see the "Stopping RabbitMQ" that I've seen in other Autoheal circumstances.</p>
<p class="">I can send more complete logs, but wanted to see if this is a known issue or expected behavior first.</p><p class=""><br></p><p class="">Matt</p></div></div>