[rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal failures when recovering from cluster partition

Thu May 30 22:14:33 BST 2013

Follow-up question ...

I tried some experiments to gain some understanding of how the cluster behaved with clients attached during a network partition event.  Essentially, I repeated the previous tests described below for autohealing and automatic queue synchronization, but left the cluster communications port blocked while the client test completed.  One oddity I noticed was that while the consumer connected to the slave appeared to receive an indication that something was amiss (client log showed a consumer cancel exception being handled by the Spring AMQP framework, and other monitoring logs appeared to show the client restarting a connection, which seems to be consistent with documentation), the consumer connected to the master seemed to remain oblivious to any possible issues.  That consumer continued to receive messages, but at an extremely slow rate (test published at 16/sec fixed rate, but the remaining consumer began to receive messages at the rate of about 1 every 14 seconds).

Since the test client waits around for expected message deliveries with a resettable 30 second timeout, it continued to run for an extended period of time (longer than I waited around for).  In addition, the admin console showed a relatively small number of unacked messages on that server, with the unacked count increasing with each actual delivery (client should always be acknowledging in the test setup, and reported no errors).  Eventually unblocking the cluster port released a bunch of messages in a short interval (albeit with some lost, as described previously).

I also saw  producer connections go into flow control during the outage and remain there during the slow consumer delivery (though the test had long since completed delivering all its messages).

Does this sound like expected behavior during a partition?

Ray Maslinski
Senior Software Developer, Engineering
Valassis / Digital Media
Cell: 585.330.2426
maslinskir at valassis.com
www.valassis.com

Creating the future of intelligent media delivery to drive your greatest success

_____________________________________________________________________________

This message may include proprietary or protected information. If you are not the intended 
recipient, please notify me, delete this message and do not further communicate the information 
contained herein without my express consent.

-----Original Message-----
From: Simon MacMullen [mailto:simon at rabbitmq.com] 
Sent: Monday, May 20, 2013 6:30 AM
To: Discussions about RabbitMQ
Cc: Maslinski, Ray
Subject: Re: [rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal failures when recovering from cluster partition

On 17/05/13 20:38, Maslinski, Ray wrote:
> Hello,

Hi!

> To simulate a network partition failure, I've been using iptables to 
> temporarily block inbound and outbound access on one of the nodes to 
> the single port configured for cluster communications through 
> inet_dist_listen_min and inet_dist_listen_max settings (min = max).
> Client access is not blocked during a simulated partition fault.

Sounds reasonable.

> I've observed two anomalies during testing that I wasn't expecting 
> based on the documentation I've read:
>
> -At a sufficiently high message rate, some number of messages will be 
> lost during the fault sequence, with the number lost tending to 
> increase with message rate.  No indication of a send error has been 
> observed by the client program. Based on results obtained from test 
> logs and an independent monitor listening on trace messages from each 
> node, it appears that as soon as the port is blocked, both nodes 
> continue to accept published messages, but (temporarily) stop 
> delivering messages until the cluster heartbeat failure is detected, 
> at which point the cluster is partitioned and the slave promotes itself to become master.
> In the sequences I've looked at, the messages that are lost all appear 
> to be published to the original master (and final master after a 
> winner is selected during autoheal).  Neither the start nor the end of 
> the lost message window appear to line up with any events in the logs, 
> other than the start occurring sometime after the port connection is 
> blocked but before the cluster heartbeat failure is detected, and the 
> end occurring sometime after the detection of the cluster heartbeat 
> failure and before the detection of the partitioned cluster after the 
> connection is unblocked.  Is message loss to be expected in this scenario?

I would expect to see message loss in a cluster heal scenario.

It's important to remember that a cluster partition is still a substantial problem, and the healing process involves throwing state away. Autoheal mode just means you get through this process faster, and hopefully spend much less time accepting messages that will end up being lost.

I would expect intuitively that only messages from the losing partitions would be lost. But I am not entirely surprised if messages from the winner are lost too; there is a period after the partitions have come back together but before autoheal kicks in during which we will have multiple masters for a queue, and behaviour can be unpredictable.

> -Occasionally the autoheal loser node fails to rejoin the cluster 
> after restart.  I don't have a lot of data points on this one since 
> it's only happened a handful of times during overnight test 
> iterations.  During one failure, the autoheal winner showed the log 
> message below during
> recovery:

Ah, that looks like a bug in autoheal. I think the stack trace you posted should contain enough information to fix it. Thanks.

Cheers, Simon

--
Simon MacMullen
RabbitMQ, Pivotal