[rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal failures when recovering from cluster partition

Mon May 20 11:30:01 BST 2013

On 17/05/13 20:38, Maslinski, Ray wrote:
> Hello,

Hi!

> To simulate a network partition failure, I’ve been using iptables to
> temporarily block inbound and outbound access on one of the nodes to the
> single port configured for cluster communications through
> inet_dist_listen_min and inet_dist_listen_max settings (min = max).
> Client access is not blocked during a simulated partition fault.

Sounds reasonable.

> I’ve observed two anomalies during testing that I wasn’t expecting based
> on the documentation I’ve read:
>
> -At a sufficiently high message rate, some number of messages will be
> lost during the fault sequence, with the number lost tending to increase
> with message rate.  No indication of a send error has been observed by
> the client program. Based on results obtained from test logs and an
> independent monitor listening on trace messages from each node, it
> appears that as soon as the port is blocked, both nodes continue to
> accept published messages, but (temporarily) stop delivering messages
> until the cluster heartbeat failure is detected, at which point the
> cluster is partitioned and the slave promotes itself to become master.
> In the sequences I’ve looked at, the messages that are lost all appear
> to be published to the original master (and final master after a winner
> is selected during autoheal).  Neither the start nor the end of the lost
> message window appear to line up with any events in the logs, other than
> the start occurring sometime after the port connection is blocked but
> before the cluster heartbeat failure is detected, and the end occurring
> sometime after the detection of the cluster heartbeat failure and before
> the detection of the partitioned cluster after the connection is
> unblocked.  Is message loss to be expected in this scenario?

I would expect to see message loss in a cluster heal scenario.

It's important to remember that a cluster partition is still a 
substantial problem, and the healing process involves throwing state 
away. Autoheal mode just means you get through this process faster, and 
hopefully spend much less time accepting messages that will end up being 
lost.

I would expect intuitively that only messages from the losing partitions 
would be lost. But I am not entirely surprised if messages from the 
winner are lost too; there is a period after the partitions have come 
back together but before autoheal kicks in during which we will have 
multiple masters for a queue, and behaviour can be unpredictable.

> -Occasionally the autoheal loser node fails to rejoin the cluster after
> restart.  I don’t have a lot of data points on this one since it’s only
> happened a handful of times during overnight test iterations.  During
> one failure, the autoheal winner showed the log message below during
> recovery:

Ah, that looks like a bug in autoheal. I think the stack trace you 
posted should contain enough information to fix it. Thanks.

Cheers, Simon

-- 
Simon MacMullen
RabbitMQ, Pivotal