[rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal failures when recovering from cluster partition
simon at rabbitmq.com
Fri May 31 17:32:57 BST 2013
Hi. The behaviour you describe doesn't really match what I would expect
to see. In the beginning of a partition I would expect:
1) Partition starts
2) Things behave slowly for approximately net_ticktime (see
http://www.rabbitmq.com/partitions.html#net_ticktime) as nodes attempt
to contact each other and time out
3) On each node, Erlang decides the other nodes are down. Things speed
up again, HA queues fail over. Split-brain has begun.
It sounds like you were stuck in 2) for an extended period of time. Have
you changed net_ticktime?
Alternatively, when testing partitions I have see really odd behaviour
by blocking network traffic in one direction only with iptables. You
might want to check if you've done that by mistake.
On 30/05/13 22:14, Maslinski, Ray wrote:
> Follow-up question ...
> I tried some experiments to gain some understanding of how the
> cluster behaved with clients attached during a network partition
> event. Essentially, I repeated the previous tests described below
> for autohealing and automatic queue synchronization, but left the
> cluster communications port blocked while the client test completed.
> One oddity I noticed was that while the consumer connected to the
> slave appeared to receive an indication that something was amiss
> (client log showed a consumer cancel exception being handled by the
> Spring AMQP framework, and other monitoring logs appeared to show the
> client restarting a connection, which seems to be consistent with
> documentation), the consumer connected to the master seemed to remain
> oblivious to any possible issues. That consumer continued to receive
> messages, but at an extremely slow rate (test published at 16/sec
> fixed rate, but the remaining consumer began to receive messages at
> the rate of about 1 every 14 seconds).
> Since the test client waits around for expected message deliveries
> with a resettable 30 second timeout, it continued to run for an
> extended period of time (longer than I waited around for). In
> addition, the admin console showed a relatively small number of
> unacked messages on that server, with the unacked count increasing
> with each actual delivery (client should always be acknowledging in
> the test setup, and reported no errors). Eventually unblocking the
> cluster port released a bunch of messages in a short interval (albeit
> with some lost, as described previously).
> I also saw producer connections go into flow control during the
> outage and remain there during the slow consumer delivery (though the
> test had long since completed delivering all its messages).
> Does this sound like expected behavior during a partition?
> Ray Maslinski Senior Software Developer, Engineering Valassis /
> Digital Media Cell: 585.330.2426 maslinskir at valassis.com
> Creating the future of intelligent media delivery to drive your
> greatest success
> This message may include proprietary or protected information. If
> you are not the intended recipient, please notify me, delete this
> message and do not further communicate the information contained
> herein without my express consent.
> -----Original Message----- From: Simon MacMullen
> [mailto:simon at rabbitmq.com] Sent: Monday, May 20, 2013 6:30 AM To:
> Discussions about RabbitMQ Cc: Maslinski, Ray Subject: Re:
> [rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal failures
> when recovering from cluster partition
> On 17/05/13 20:38, Maslinski, Ray wrote:
>> To simulate a network partition failure, I've been using iptables
>> to temporarily block inbound and outbound access on one of the
>> nodes to the single port configured for cluster communications
>> through inet_dist_listen_min and inet_dist_listen_max settings (min
>> = max). Client access is not blocked during a simulated partition
> Sounds reasonable.
>> I've observed two anomalies during testing that I wasn't expecting
>> based on the documentation I've read:
>> -At a sufficiently high message rate, some number of messages will
>> be lost during the fault sequence, with the number lost tending to
>> increase with message rate. No indication of a send error has
>> been observed by the client program. Based on results obtained from
>> test logs and an independent monitor listening on trace messages
>> from each node, it appears that as soon as the port is blocked,
>> both nodes continue to accept published messages, but (temporarily)
>> stop delivering messages until the cluster heartbeat failure is
>> detected, at which point the cluster is partitioned and the slave
>> promotes itself to become master. In the sequences I've looked at,
>> the messages that are lost all appear to be published to the
>> original master (and final master after a winner is selected during
>> autoheal). Neither the start nor the end of the lost message
>> window appear to line up with any events in the logs, other than
>> the start occurring sometime after the port connection is blocked
>> but before the cluster heartbeat failure is detected, and the end
>> occurring sometime after the detection of the cluster heartbeat
>> failure and before the detection of the partitioned cluster after
>> the connection is unblocked. Is message loss to be expected in
>> this scenario?
> I would expect to see message loss in a cluster heal scenario.
> It's important to remember that a cluster partition is still a
> substantial problem, and the healing process involves throwing state
> away. Autoheal mode just means you get through this process faster,
> and hopefully spend much less time accepting messages that will end
> up being lost.
> I would expect intuitively that only messages from the losing
> partitions would be lost. But I am not entirely surprised if messages
> from the winner are lost too; there is a period after the partitions
> have come back together but before autoheal kicks in during which we
> will have multiple masters for a queue, and behaviour can be
>> -Occasionally the autoheal loser node fails to rejoin the cluster
>> after restart. I don't have a lot of data points on this one
>> since it's only happened a handful of times during overnight test
>> iterations. During one failure, the autoheal winner showed the
>> log message below during recovery:
> Ah, that looks like a bug in autoheal. I think the stack trace you
> posted should contain enough information to fix it. Thanks.
> Cheers, Simon
> -- Simon MacMullen RabbitMQ, Pivotal
More information about the rabbitmq-discuss