[rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal failures when recovering from cluster partition

Maslinski, Ray MaslinskiR at valassis.com
Fri May 31 20:04:02 BST 2013


The net_ticktime is using the default value, and event sequence seems consistent with the associated timeout.

Roughly,

Start test

Block cluster connection at some point -> trace data shows that only publish events are occurring after this happens

After a bit more than a minute, RabbitMQ server log shows lost cluster connection on both nodes, along with the death of the mirrors and promotion of slave to master.

Test continues and completes publish phase, since these appeared to be progressing normally, with no errors visible to client

Test then starts to verify that all messages were received by polling the contents of a buffer used to store incoming messages.  In a normal scenario without faults, this generally means that the first poll finds all the messages expected to be present and test completes successfully.  Otherwise, the poll rechecks for any missing messages once per second, and gives up if 30 seconds elapse without any new messages showing up (timer is reset if any messages arrived during polling window).

The slow trickle of messages occurs during the last phase, as one appears to arrive every 14 seconds and the polling loop continues to retry.  Trace data doesn't appear to log any deliver events during this stage, but console shows a slow increment in the number of unacknowledged messages that lines up with the delivery rate observed by the client.

Ray Maslinski
Senior Software Developer, Engineering
Valassis / Digital Media
Cell: 585.330.2426
maslinskir at valassis.com
www.valassis.com

Creating the future of intelligent media delivery to drive your greatest success

_____________________________________________________________________________

This message may include proprietary or protected information. If you are not the intended 
recipient, please notify me, delete this message and do not further communicate the information 
contained herein without my express consent.


-----Original Message-----
From: Simon MacMullen [mailto:simon at rabbitmq.com] 
Sent: Friday, May 31, 2013 12:33 PM
To: Maslinski, Ray
Cc: Discussions about RabbitMQ
Subject: Re: [rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal failures when recovering from cluster partition

Hi. The behaviour you describe doesn't really match what I would expect to see. In the beginning of a partition I would expect:

1) Partition starts

2) Things behave slowly for approximately net_ticktime (see
http://www.rabbitmq.com/partitions.html#net_ticktime) as nodes attempt to contact each other and time out

3) On each node, Erlang decides the other nodes are down. Things speed up again, HA queues fail over. Split-brain has begun.

It sounds like you were stuck in 2) for an extended period of time. Have you changed net_ticktime?

Alternatively, when testing partitions I have see really odd behaviour by blocking network traffic in one direction only with iptables. You might want to check if you've done that by mistake.

Cheers, Simon

On 30/05/13 22:14, Maslinski, Ray wrote:
> Follow-up question ...
>
> I tried some experiments to gain some understanding of how the cluster 
> behaved with clients attached during a network partition event.  
> Essentially, I repeated the previous tests described below for 
> autohealing and automatic queue synchronization, but left the cluster 
> communications port blocked while the client test completed.
> One oddity I noticed was that while the consumer connected to the 
> slave appeared to receive an indication that something was amiss 
> (client log showed a consumer cancel exception being handled by the 
> Spring AMQP framework, and other monitoring logs appeared to show the 
> client restarting a connection, which seems to be consistent with 
> documentation), the consumer connected to the master seemed to remain 
> oblivious to any possible issues.  That consumer continued to receive 
> messages, but at an extremely slow rate (test published at 16/sec 
> fixed rate, but the remaining consumer began to receive messages at 
> the rate of about 1 every 14 seconds).
>
> Since the test client waits around for expected message deliveries 
> with a resettable 30 second timeout, it continued to run for an 
> extended period of time (longer than I waited around for).  In 
> addition, the admin console showed a relatively small number of 
> unacked messages on that server, with the unacked count increasing 
> with each actual delivery (client should always be acknowledging in 
> the test setup, and reported no errors).  Eventually unblocking the 
> cluster port released a bunch of messages in a short interval (albeit 
> with some lost, as described previously).
>
> I also saw  producer connections go into flow control during the 
> outage and remain there during the slow consumer delivery (though the 
> test had long since completed delivering all its messages).
>
> Does this sound like expected behavior during a partition?
>
> Ray Maslinski Senior Software Developer, Engineering Valassis / 
> Digital Media Cell: 585.330.2426 maslinskir at valassis.com 
> www.valassis.com
>
> Creating the future of intelligent media delivery to drive your 
> greatest success
>
> ______________________________________________________________________
> _______
>
>  This message may include proprietary or protected information. If you 
> are not the intended recipient, please notify me, delete this message 
> and do not further communicate the information contained herein 
> without my express consent.
>
> -----Original Message----- From: Simon MacMullen 
> [mailto:simon at rabbitmq.com] Sent: Monday, May 20, 2013 6:30 AM To:
> Discussions about RabbitMQ Cc: Maslinski, Ray Subject: Re:
> [rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal failures 
> when recovering from cluster partition
>
> On 17/05/13 20:38, Maslinski, Ray wrote:
>> Hello,
>
> Hi!
>
>> To simulate a network partition failure, I've been using iptables to 
>> temporarily block inbound and outbound access on one of the nodes to 
>> the single port configured for cluster communications through 
>> inet_dist_listen_min and inet_dist_listen_max settings (min = max). 
>> Client access is not blocked during a simulated partition fault.
>
> Sounds reasonable.
>
>> I've observed two anomalies during testing that I wasn't expecting 
>> based on the documentation I've read:
>>
>> -At a sufficiently high message rate, some number of messages will be 
>> lost during the fault sequence, with the number lost tending to 
>> increase with message rate.  No indication of a send error has been 
>> observed by the client program. Based on results obtained from test 
>> logs and an independent monitor listening on trace messages from each 
>> node, it appears that as soon as the port is blocked, both nodes 
>> continue to accept published messages, but (temporarily) stop 
>> delivering messages until the cluster heartbeat failure is detected, 
>> at which point the cluster is partitioned and the slave promotes 
>> itself to become master. In the sequences I've looked at, the 
>> messages that are lost all appear to be published to the original 
>> master (and final master after a winner is selected during autoheal).  
>> Neither the start nor the end of the lost message window appear to 
>> line up with any events in the logs, other than the start occurring 
>> sometime after the port connection is blocked but before the cluster 
>> heartbeat failure is detected, and the end occurring sometime after 
>> the detection of the cluster heartbeat failure and before the 
>> detection of the partitioned cluster after the connection is 
>> unblocked.  Is message loss to be expected in this scenario?
>
> I would expect to see message loss in a cluster heal scenario.
>
> It's important to remember that a cluster partition is still a 
> substantial problem, and the healing process involves throwing state 
> away. Autoheal mode just means you get through this process faster, 
> and hopefully spend much less time accepting messages that will end up 
> being lost.
>
> I would expect intuitively that only messages from the losing 
> partitions would be lost. But I am not entirely surprised if messages 
> from the winner are lost too; there is a period after the partitions 
> have come back together but before autoheal kicks in during which we 
> will have multiple masters for a queue, and behaviour can be 
> unpredictable.
>
>> -Occasionally the autoheal loser node fails to rejoin the cluster 
>> after restart.  I don't have a lot of data points on this one since 
>> it's only happened a handful of times during overnight test 
>> iterations.  During one failure, the autoheal winner showed the log 
>> message below during recovery:
>
> Ah, that looks like a bug in autoheal. I think the stack trace you 
> posted should contain enough information to fix it. Thanks.
>
> Cheers, Simon
>
> -- Simon MacMullen RabbitMQ, Pivotal
>


--
Simon MacMullen
RabbitMQ, Pivotal


More information about the rabbitmq-discuss mailing list