[rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal failures when recovering from cluster partition

Fri May 31 17:32:57 BST 2013

Hi. The behaviour you describe doesn't really match what I would expect 
to see. In the beginning of a partition I would expect:

1) Partition starts

2) Things behave slowly for approximately net_ticktime (see 
http://www.rabbitmq.com/partitions.html#net_ticktime) as nodes attempt 
to contact each other and time out

3) On each node, Erlang decides the other nodes are down. Things speed 
up again, HA queues fail over. Split-brain has begun.

It sounds like you were stuck in 2) for an extended period of time. Have 
you changed net_ticktime?

Alternatively, when testing partitions I have see really odd behaviour 
by blocking network traffic in one direction only with iptables. You 
might want to check if you've done that by mistake.

Cheers, Simon

On 30/05/13 22:14, Maslinski, Ray wrote:
> Follow-up question ...
>
> I tried some experiments to gain some understanding of how the
> cluster behaved with clients attached during a network partition
> event.  Essentially, I repeated the previous tests described below
> for autohealing and automatic queue synchronization, but left the
> cluster communications port blocked while the client test completed.
> One oddity I noticed was that while the consumer connected to the
> slave appeared to receive an indication that something was amiss
> (client log showed a consumer cancel exception being handled by the
> Spring AMQP framework, and other monitoring logs appeared to show the
> client restarting a connection, which seems to be consistent with
> documentation), the consumer connected to the master seemed to remain
> oblivious to any possible issues.  That consumer continued to receive
> messages, but at an extremely slow rate (test published at 16/sec
> fixed rate, but the remaining consumer began to receive messages at
> the rate of about 1 every 14 seconds).
>
> Since the test client waits around for expected message deliveries
> with a resettable 30 second timeout, it continued to run for an
> extended period of time (longer than I waited around for).  In
> addition, the admin console showed a relatively small number of
> unacked messages on that server, with the unacked count increasing
> with each actual delivery (client should always be acknowledging in
> the test setup, and reported no errors).  Eventually unblocking the
> cluster port released a bunch of messages in a short interval (albeit
> with some lost, as described previously).
>
> I also saw  producer connections go into flow control during the
> outage and remain there during the slow consumer delivery (though the
> test had long since completed delivering all its messages).
>
> Does this sound like expected behavior during a partition?
>
> Ray Maslinski Senior Software Developer, Engineering Valassis /
> Digital Media Cell: 585.330.2426 maslinskir at valassis.com
> www.valassis.com
>
> Creating the future of intelligent media delivery to drive your
> greatest success
>
> _____________________________________________________________________________
>
>  This message may include proprietary or protected information. If
> you are not the intended recipient, please notify me, delete this
> message and do not further communicate the information contained
> herein without my express consent.
>
> -----Original Message----- From: Simon MacMullen
> [mailto:simon at rabbitmq.com] Sent: Monday, May 20, 2013 6:30 AM To:
> Discussions about RabbitMQ Cc: Maslinski, Ray Subject: Re:
> [rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal failures
> when recovering from cluster partition
>
> On 17/05/13 20:38, Maslinski, Ray wrote:
>> Hello,
>
> Hi!
>
>> To simulate a network partition failure, I've been using iptables
>> to temporarily block inbound and outbound access on one of the
>> nodes to the single port configured for cluster communications
>> through inet_dist_listen_min and inet_dist_listen_max settings (min
>> = max). Client access is not blocked during a simulated partition
>> fault.
>
> Sounds reasonable.
>
>> I've observed two anomalies during testing that I wasn't expecting
>> based on the documentation I've read:
>>
>> -At a sufficiently high message rate, some number of messages will
>> be lost during the fault sequence, with the number lost tending to
>> increase with message rate.  No indication of a send error has
>> been observed by the client program. Based on results obtained from
>> test logs and an independent monitor listening on trace messages
>> from each node, it appears that as soon as the port is blocked,
>> both nodes continue to accept published messages, but (temporarily)
>> stop delivering messages until the cluster heartbeat failure is
>> detected, at which point the cluster is partitioned and the slave
>> promotes itself to become master. In the sequences I've looked at,
>> the messages that are lost all appear to be published to the
>> original master (and final master after a winner is selected during
>> autoheal).  Neither the start nor the end of the lost message
>> window appear to line up with any events in the logs, other than
>> the start occurring sometime after the port connection is blocked
>> but before the cluster heartbeat failure is detected, and the end
>> occurring sometime after the detection of the cluster heartbeat
>> failure and before the detection of the partitioned cluster after
>> the connection is unblocked.  Is message loss to be expected in
>> this scenario?
>
> I would expect to see message loss in a cluster heal scenario.
>
> It's important to remember that a cluster partition is still a
> substantial problem, and the healing process involves throwing state
> away. Autoheal mode just means you get through this process faster,
> and hopefully spend much less time accepting messages that will end
> up being lost.
>
> I would expect intuitively that only messages from the losing
> partitions would be lost. But I am not entirely surprised if messages
> from the winner are lost too; there is a period after the partitions
> have come back together but before autoheal kicks in during which we
> will have multiple masters for a queue, and behaviour can be
> unpredictable.
>
>> -Occasionally the autoheal loser node fails to rejoin the cluster
>> after restart.  I don't have a lot of data points on this one
>> since it's only happened a handful of times during overnight test
>> iterations.  During one failure, the autoheal winner showed the
>> log message below during recovery:
>
> Ah, that looks like a bug in autoheal. I think the stack trace you
> posted should contain enough information to fix it. Thanks.
>
> Cheers, Simon
>
> -- Simon MacMullen RabbitMQ, Pivotal
>

-- 
Simon MacMullen
RabbitMQ, Pivotal