[rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal failures when recovering from cluster partition

Mon Jun 3 14:01:54 BST 2013

For this particular test, I was trying to gain some insight as to the behavior of the cluster while the partition was in effect, so the partition wasn't resolved until the test was completed or aborted.  My general expectation was that both sides of the partition would begin to process messages more or less normally once the cluster connection timed out and the slave promoted itself to master, possibly with some glitches such as dropped messages during the transition.  The slow trickle of messages delivered on one of the existing connections was unexpected, and seemed to be a bit of an anomaly when combined with the growing count of unacked messages shown on the console.

Ray Maslinski
Senior Software Developer, Engineering
Valassis / Digital Media
Cell: 585.330.2426
maslinskir at valassis.com
www.valassis.com

Creating the future of intelligent media delivery to drive your greatest success

_____________________________________________________________________________

This message may include proprietary or protected information. If you are not the intended 
recipient, please notify me, delete this message and do not further communicate the information 
contained herein without my express consent.

-----Original Message-----
From: Simon MacMullen [mailto:simon at rabbitmq.com] 
Sent: Monday, June 03, 2013 6:37 AM
To: Maslinski, Ray
Cc: Discussions about RabbitMQ
Subject: Re: [rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal failures when recovering from cluster partition

Just to be clear - at what point in this test are you unblocking the connection so the partition can heal?

Cheers, Simon

On 31/05/13 20:04, Maslinski, Ray wrote:
> The net_ticktime is using the default value, and event sequence seems 
> consistent with the associated timeout.
>
> Roughly,
>
> Start test
>
> Block cluster connection at some point -> trace data shows that only 
> publish events are occurring after this happens
>
> After a bit more than a minute, RabbitMQ server log shows lost cluster 
> connection on both nodes, along with the death of the mirrors and 
> promotion of slave to master.
>
> Test continues and completes publish phase, since these appeared to be 
> progressing normally, with no errors visible to client
>
> Test then starts to verify that all messages were received by polling 
> the contents of a buffer used to store incoming messages.  In a normal 
> scenario without faults, this generally means that the first poll 
> finds all the messages expected to be present and test completes 
> successfully.  Otherwise, the poll rechecks for any missing messages 
> once per second, and gives up if 30 seconds elapse without any new 
> messages showing up (timer is reset if any messages arrived during 
> polling window).
>
> The slow trickle of messages occurs during the last phase, as one 
> appears to arrive every 14 seconds and the polling loop continues to 
> retry.  Trace data doesn't appear to log any deliver events during 
> this stage, but console shows a slow increment in the number of 
> unacknowledged messages that lines up with the delivery rate observed 
> by the client.
>
> Ray Maslinski Senior Software Developer, Engineering Valassis / 
> Digital Media Cell: 585.330.2426 maslinskir at valassis.com 
> www.valassis.com
>
> Creating the future of intelligent media delivery to drive your 
> greatest success
>
> ______________________________________________________________________
> _______
>
>  This message may include proprietary or protected information. If you 
> are not the intended recipient, please notify me, delete this message 
> and do not further communicate the information contained herein 
> without my express consent.
>
>
> -----Original Message----- From: Simon MacMullen 
> [mailto:simon at rabbitmq.com] Sent: Friday, May 31, 2013 12:33 PM To:
> Maslinski, Ray Cc: Discussions about RabbitMQ Subject: Re:
> [rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal failures 
> when recovering from cluster partition
>
> Hi. The behaviour you describe doesn't really match what I would 
> expect to see. In the beginning of a partition I would expect:
>
> 1) Partition starts
>
> 2) Things behave slowly for approximately net_ticktime (see
> http://www.rabbitmq.com/partitions.html#net_ticktime) as nodes attempt 
> to contact each other and time out
>
> 3) On each node, Erlang decides the other nodes are down. Things speed 
> up again, HA queues fail over. Split-brain has begun.
>
> It sounds like you were stuck in 2) for an extended period of time.
> Have you changed net_ticktime?
>
> Alternatively, when testing partitions I have see really odd behaviour 
> by blocking network traffic in one direction only with iptables. You 
> might want to check if you've done that by mistake.
>
> Cheers, Simon
>
> On 30/05/13 22:14, Maslinski, Ray wrote:
>> Follow-up question ...
>>
>> I tried some experiments to gain some understanding of how the 
>> cluster behaved with clients attached during a network partition 
>> event. Essentially, I repeated the previous tests described below for 
>> autohealing and automatic queue synchronization, but left the cluster 
>> communications port blocked while the client test completed. One 
>> oddity I noticed was that while the consumer connected to the slave 
>> appeared to receive an indication that something was amiss (client 
>> log showed a consumer cancel exception being handled by the Spring 
>> AMQP framework, and other monitoring logs appeared to show the client 
>> restarting a connection, which seems to be consistent with 
>> documentation), the consumer connected to the master seemed to remain 
>> oblivious to any possible issues.
>> That consumer continued to receive messages, but at an extremely slow 
>> rate (test published at 16/sec fixed rate, but the remaining consumer 
>> began to receive messages at the rate of about 1 every 14 seconds).
>>
>> Since the test client waits around for expected message deliveries 
>> with a resettable 30 second timeout, it continued to run for an 
>> extended period of time (longer than I waited around for).  In 
>> addition, the admin console showed a relatively small number of 
>> unacked messages on that server, with the unacked count increasing 
>> with each actual delivery (client should always be acknowledging in 
>> the test setup, and reported no errors).  Eventually unblocking the 
>> cluster port released a bunch of messages in a short interval (albeit 
>> with some lost, as described previously).
>>
>> I also saw  producer connections go into flow control during the 
>> outage and remain there during the slow consumer delivery (though the 
>> test had long since completed delivering all its messages).
>>
>> Does this sound like expected behavior during a partition?
>>
>> Ray Maslinski Senior Software Developer, Engineering Valassis / 
>> Digital Media Cell: 585.330.2426 maslinskir at valassis.com 
>> www.valassis.com
>>
>> Creating the future of intelligent media delivery to drive your 
>> greatest success
>>
>> _____________________________________________________________________
>> _
>>
>>
_______
>>
>> This message may include proprietary or protected information. If you 
>> are not the intended recipient, please notify me, delete this message 
>> and do not further communicate the information contained herein 
>> without my express consent.
>>
>> -----Original Message----- From: Simon MacMullen 
>> [mailto:simon at rabbitmq.com] Sent: Monday, May 20, 2013 6:30 AM To:
>> Discussions about RabbitMQ Cc: Maslinski, Ray Subject: Re:
>> [rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal failures 
>> when recovering from cluster partition
>>
>> On 17/05/13 20:38, Maslinski, Ray wrote:
>>> Hello,
>>
>> Hi!
>>
>>> To simulate a network partition failure, I've been using iptables to 
>>> temporarily block inbound and outbound access on one of the nodes to 
>>> the single port configured for cluster communications through 
>>> inet_dist_listen_min and inet_dist_listen_max settings (min = max). 
>>> Client access is not blocked during a simulated partition fault.
>>
>> Sounds reasonable.
>>
>>> I've observed two anomalies during testing that I wasn't expecting 
>>> based on the documentation I've read:
>>>
>>> -At a sufficiently high message rate, some number of messages will 
>>> be lost during the fault sequence, with the number lost tending to 
>>> increase with message rate.  No indication of a send error has been 
>>> observed by the client program. Based on results obtained from test 
>>> logs and an independent monitor listening on trace messages from 
>>> each node, it appears that as soon as the port is blocked, both 
>>> nodes continue to accept published messages, but (temporarily) stop 
>>> delivering messages until the cluster heartbeat failure is detected, 
>>> at which point the cluster is partitioned and the slave promotes 
>>> itself to become master. In the sequences I've looked at, the 
>>> messages that are lost all appear to be published to the original 
>>> master (and final master after a winner is selected during 
>>> autoheal). Neither the start nor the end of the lost message window 
>>> appear to line up with any events in the logs, other than the start 
>>> occurring sometime after the port connection is blocked but before 
>>> the cluster heartbeat failure is detected, and the end occurring 
>>> sometime after the detection of the cluster heartbeat failure and 
>>> before the detection of the partitioned cluster after the connection 
>>> is unblocked.  Is message loss to be expected in this scenario?
>>
>> I would expect to see message loss in a cluster heal scenario.
>>
>> It's important to remember that a cluster partition is still a 
>> substantial problem, and the healing process involves throwing state 
>> away. Autoheal mode just means you get through this process faster, 
>> and hopefully spend much less time accepting messages that will end 
>> up being lost.
>>
>> I would expect intuitively that only messages from the losing 
>> partitions would be lost. But I am not entirely surprised if messages 
>> from the winner are lost too; there is a period after the partitions 
>> have come back together but before autoheal kicks in during which we 
>> will have multiple masters for a queue, and behaviour can be 
>> unpredictable.
>>
>>> -Occasionally the autoheal loser node fails to rejoin the cluster 
>>> after restart.  I don't have a lot of data points on this one since 
>>> it's only happened a handful of times during overnight test 
>>> iterations.  During one failure, the autoheal winner showed the log 
>>> message below during recovery:
>>
>> Ah, that looks like a bug in autoheal. I think the stack trace you 
>> posted should contain enough information to fix it. Thanks.
>>
>> Cheers, Simon
>>
>> -- Simon MacMullen RabbitMQ, Pivotal
>>
>
>
> -- Simon MacMullen RabbitMQ, Pivotal
>

--
Simon MacMullen
RabbitMQ, Pivotal