[rabbitmq-discuss] Autoheal torture test - Initial success, then a terminal state
Tim Watson
tim at rabbitmq.com
Fri Mar 28 10:03:21 GMT 2014
I don't suppose you can post the code that you're using to trigger this can you?
Cheers,
Tim
On 27 Mar 2014, at 22:15, Matt Pietrek wrote:
> Because we're sometimes just mean to our software, I wrote a torture test to see how RabbitMQ's Autoheal deal with repeated partitions.
>
> In a nutshell, we start with two brokers (3.2.4) in a cluster. I run my test which uses "iptables" to knock out the link between the two brokers and then restore things.
>
> It does this break/fix continuously in a loop. The time between partitions, and the time inside partitions is configurable.
>
> Using 60 seconds between inducing a partition, and 60 seconds in a partitioned state, I expect that this might be messy - The brokers try to autoheal, and then everything falls apart. However, I'd expect that once I stop my torture and return things back to "normal", that an autoheal will eventually succeed and the brokers will be happily clustered again.
>
> This isn't what happens. Instead, the two brokers essentially ignore each other. Even after waiting for 10+ minutes. I can see each broker, but they each think the other is missing.
>
> Here's a filtered view of the logs, grepping for "Autoheal|Starting|Stopping|Partitions|Winner|Loser":
>
> rabbit at mq2.log: Autoheal request sent to rabbit at mq1
>
> rabbit at mq2.log: Autoheal: I am the winner, waiting for [rabbit at mq1] to stop
>
> rabbit at mq2.log: Autoheal: I am the winner, waiting additionally for [rabbit at mq1] to stop
>
> rabbit at mq1.log: Autoheal request sent to rabbit at mq1
>
> rabbit at mq1.log: Autoheal request received from rabbit at mq1
>
> rabbit at mq1.log: Autoheal decision
>
> rabbit at mq1.log: * Partitions: [[rabbit at mq1],[rabbit at mq2]]
>
> rabbit at mq1.log: * Winner: rabbit at mq2
>
> rabbit at mq1.log: * Losers: [rabbit at mq1]
>
> rabbit at mq1.log: Autoheal request received from rabbit at mq2
>
> rabbit at mq1.log: Autoheal decision
>
> rabbit at mq1.log: * Partitions: [[rabbit at mq1],[rabbit at mq2]]
>
> rabbit at mq1.log: * Winner: rabbit at mq2
>
> rabbit at mq1.log: * Losers: [rabbit at mq1]
>
> rabbit at mq1.log: Autoheal: we were selected to restart; winner is rabbit at mq2
>
> rabbit at mq1.log: Stopping RabbitMQ
>
> rabbit at mq2.log: Autoheal: aborting - rabbit at mq1 went down
>
> rabbit at mq2.log: Autoheal request sent to rabbit at mq1
>
> rabbit at mq2.log: Autoheal: we were selected to restart; winner is rabbit at mq1
>
> rabbit at mq2.log: Stopping RabbitMQ
>
> rabbit at mq1.log: Autoheal: aborting - rabbit at mq2 went down
>
> rabbit at mq1.log: Autoheal request sent to rabbit at mq1
>
> rabbit at mq1.log: Autoheal request received from rabbit at mq2
>
> rabbit at mq1.log: Autoheal decision
>
> rabbit at mq1.log: * Partitions: [[rabbit at mq1],[rabbit at mq2]]
>
> rabbit at mq1.log: * Winner: rabbit at mq1
>
> rabbit at mq1.log: * Losers: [rabbit at mq2]
>
> rabbit at mq1.log: Autoheal request received from rabbit at mq1
>
> rabbit at mq1.log: Autoheal decision
>
> rabbit at mq1.log: * Partitions: [[rabbit at mq1],[rabbit at mq2]]
>
> rabbit at mq1.log: * Winner: rabbit at mq1
>
> rabbit at mq1.log: * Losers: [rabbit at mq2]
>
> rabbit at mq1.log: Autoheal: I am the winner, waiting for [rabbit at mq2] to stop
>
> rabbit at mq1.log: Autoheal: I am the winner, waiting additionally for [rabbit at mq2] to stop
>
> rabbit at mq2.log: Autoheal: aborting - rabbit at mq1 went down
>
> rabbit at mq2.log: Autoheal request sent to rabbit at mq1
>
> rabbit at mq2.log: Autoheal: we were selected to restart; winner is rabbit at mq1
>
> rabbit at mq1.log: Autoheal: aborting - rabbit at mq2 went down
>
> rabbit at mq1.log: Autoheal request sent to rabbit at mq1
>
> rabbit at mq1.log: Autoheal request received from rabbit at mq2
>
> rabbit at mq1.log: Autoheal decision
>
> rabbit at mq1.log: * Partitions: [[rabbit at mq1],[rabbit at mq2]]
>
> rabbit at mq1.log: * Winner: rabbit at mq1
>
> rabbit at mq1.log: * Losers: [rabbit at mq2]
>
> rabbit at mq1.log: Autoheal request received from rabbit at mq1
>
> rabbit at mq1.log: Autoheal decision
>
> rabbit at mq1.log: * Partitions: [[rabbit at mq1],[rabbit at mq2]]
>
> rabbit at mq1.log: * Winner: rabbit at mq1
>
> rabbit at mq1.log: * Losers: [rabbit at mq2]
>
> rabbit at mq1.log: Autoheal: I am the winner, waiting for [rabbit at mq2] to stop
>
>
> rabbit at mq1.log: Autoheal: I am the winner, waiting additionally for [rabbit at mq2] to stop
>
> # And nothing else beyond this, even after waiting for 10+ minutes.
>
> I don't ever see the "Stopping RabbitMQ" that I've seen in other Autoheal circumstances.
>
> I can send more complete logs, but wanted to see if this is a known issue or expected behavior first.
>
>
>
> Matt
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140328/b1831d33/attachment.html>
More information about the rabbitmq-discuss
mailing list