[rabbitmq-discuss] Autoheal torture test - Initial success, then a terminal state

Tim Watson tim at rabbitmq.com
Fri Mar 28 10:03:21 GMT 2014


I don't suppose you can post the code that you're using to trigger this can you?

Cheers,
Tim

On 27 Mar 2014, at 22:15, Matt Pietrek wrote:

> Because we're sometimes just mean to our software, I wrote a torture test to see how RabbitMQ's Autoheal deal with repeated partitions.
> 
> In a nutshell, we start with two brokers (3.2.4) in a cluster. I run my test which uses "iptables" to knock out the link between the two brokers and then restore things.
> 
> It does this break/fix continuously in a loop. The time between partitions, and the time inside partitions is configurable.
> 
> Using 60 seconds between inducing a partition, and 60 seconds in a partitioned state, I expect that this might be messy - The brokers try to autoheal, and then everything falls apart. However, I'd expect that once I stop my torture and return things back to "normal", that an autoheal will eventually succeed and the brokers will be happily clustered again.
> 
> This isn't what happens. Instead, the two brokers essentially ignore each other. Even after waiting for 10+ minutes. I can see each broker, but they each think the other is missing.
> 
> Here's a filtered view of the logs, grepping for "Autoheal|Starting|Stopping|Partitions|Winner|Loser":
> 
> rabbit at mq2.log: Autoheal request sent to rabbit at mq1
> 
> rabbit at mq2.log: Autoheal: I am the winner, waiting for [rabbit at mq1] to stop
> 
> rabbit at mq2.log: Autoheal: I am the winner, waiting additionally for [rabbit at mq1] to stop
> 
> rabbit at mq1.log: Autoheal request sent to rabbit at mq1
> 
> rabbit at mq1.log: Autoheal request received from rabbit at mq1
> 
> rabbit at mq1.log: Autoheal decision
> 
> rabbit at mq1.log:  * Partitions: [[rabbit at mq1],[rabbit at mq2]]
> 
> rabbit at mq1.log:  * Winner:     rabbit at mq2
> 
> rabbit at mq1.log:  * Losers:     [rabbit at mq1]
> 
> rabbit at mq1.log: Autoheal request received from rabbit at mq2
> 
> rabbit at mq1.log: Autoheal decision
> 
> rabbit at mq1.log:  * Partitions: [[rabbit at mq1],[rabbit at mq2]]
> 
> rabbit at mq1.log:  * Winner:     rabbit at mq2
> 
> rabbit at mq1.log:  * Losers:     [rabbit at mq1]
> 
> rabbit at mq1.log: Autoheal: we were selected to restart; winner is rabbit at mq2
> 
> rabbit at mq1.log: Stopping RabbitMQ
> 
> rabbit at mq2.log: Autoheal: aborting - rabbit at mq1 went down
> 
> rabbit at mq2.log: Autoheal request sent to rabbit at mq1
> 
> rabbit at mq2.log: Autoheal: we were selected to restart; winner is rabbit at mq1
> 
> rabbit at mq2.log: Stopping RabbitMQ
> 
> rabbit at mq1.log: Autoheal: aborting - rabbit at mq2 went down
> 
> rabbit at mq1.log: Autoheal request sent to rabbit at mq1
> 
> rabbit at mq1.log: Autoheal request received from rabbit at mq2
> 
> rabbit at mq1.log: Autoheal decision
> 
> rabbit at mq1.log:  * Partitions: [[rabbit at mq1],[rabbit at mq2]]
> 
> rabbit at mq1.log:  * Winner:     rabbit at mq1
> 
> rabbit at mq1.log:  * Losers:     [rabbit at mq2]
> 
> rabbit at mq1.log: Autoheal request received from rabbit at mq1
> 
> rabbit at mq1.log: Autoheal decision
> 
> rabbit at mq1.log:  * Partitions: [[rabbit at mq1],[rabbit at mq2]]
> 
> rabbit at mq1.log:  * Winner:     rabbit at mq1
> 
> rabbit at mq1.log:  * Losers:     [rabbit at mq2]
> 
> rabbit at mq1.log: Autoheal: I am the winner, waiting for [rabbit at mq2] to stop
> 
> rabbit at mq1.log: Autoheal: I am the winner, waiting additionally for [rabbit at mq2] to stop
> 
> rabbit at mq2.log: Autoheal: aborting - rabbit at mq1 went down
> 
> rabbit at mq2.log: Autoheal request sent to rabbit at mq1
> 
> rabbit at mq2.log: Autoheal: we were selected to restart; winner is rabbit at mq1
> 
> rabbit at mq1.log: Autoheal: aborting - rabbit at mq2 went down
> 
> rabbit at mq1.log: Autoheal request sent to rabbit at mq1
> 
> rabbit at mq1.log: Autoheal request received from rabbit at mq2
> 
> rabbit at mq1.log: Autoheal decision
> 
> rabbit at mq1.log:  * Partitions: [[rabbit at mq1],[rabbit at mq2]]
> 
> rabbit at mq1.log:  * Winner:     rabbit at mq1
> 
> rabbit at mq1.log:  * Losers:     [rabbit at mq2]
> 
> rabbit at mq1.log: Autoheal request received from rabbit at mq1
> 
> rabbit at mq1.log: Autoheal decision
> 
> rabbit at mq1.log:  * Partitions: [[rabbit at mq1],[rabbit at mq2]]
> 
> rabbit at mq1.log:  * Winner:     rabbit at mq1
> 
> rabbit at mq1.log:  * Losers:     [rabbit at mq2]
> 
> rabbit at mq1.log: Autoheal: I am the winner, waiting for [rabbit at mq2] to stop
> 
> 
> rabbit at mq1.log: Autoheal: I am the winner, waiting additionally for [rabbit at mq2] to stop
> 
> # And nothing else beyond this, even after waiting for 10+ minutes.
> 
> I don't ever see the "Stopping RabbitMQ" that I've seen in other Autoheal circumstances.
> 
> I can send more complete logs, but wanted to see if this is a known issue or expected behavior first.
> 
> 
> 
> Matt
> 
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140328/b1831d33/attachment.html>


More information about the rabbitmq-discuss mailing list