[rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal failures when recovering from cluster partition

Mon Jun 3 11:36:34 BST 2013

Just to be clear - at what point in this test are you unblocking the 
connection so the partition can heal?

Cheers, Simon

On 31/05/13 20:04, Maslinski, Ray wrote:
> The net_ticktime is using the default value, and event sequence seems
> consistent with the associated timeout.
>
> Roughly,
>
> Start test
>
> Block cluster connection at some point -> trace data shows that only
> publish events are occurring after this happens
>
> After a bit more than a minute, RabbitMQ server log shows lost
> cluster connection on both nodes, along with the death of the mirrors
> and promotion of slave to master.
>
> Test continues and completes publish phase, since these appeared to
> be progressing normally, with no errors visible to client
>
> Test then starts to verify that all messages were received by polling
> the contents of a buffer used to store incoming messages.  In a
> normal scenario without faults, this generally means that the first
> poll finds all the messages expected to be present and test completes
> successfully.  Otherwise, the poll rechecks for any missing messages
> once per second, and gives up if 30 seconds elapse without any new
> messages showing up (timer is reset if any messages arrived during
> polling window).
>
> The slow trickle of messages occurs during the last phase, as one
> appears to arrive every 14 seconds and the polling loop continues to
> retry.  Trace data doesn't appear to log any deliver events during
> this stage, but console shows a slow increment in the number of
> unacknowledged messages that lines up with the delivery rate observed
> by the client.
>
> Ray Maslinski Senior Software Developer, Engineering Valassis /
> Digital Media Cell: 585.330.2426 maslinskir at valassis.com
> www.valassis.com
>
> Creating the future of intelligent media delivery to drive your
> greatest success
>
> _____________________________________________________________________________
>
>  This message may include proprietary or protected information. If
> you are not the intended recipient, please notify me, delete this
> message and do not further communicate the information contained
> herein without my express consent.
>
>
> -----Original Message----- From: Simon MacMullen
> [mailto:simon at rabbitmq.com] Sent: Friday, May 31, 2013 12:33 PM To:
> Maslinski, Ray Cc: Discussions about RabbitMQ Subject: Re:
> [rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal failures
> when recovering from cluster partition
>
> Hi. The behaviour you describe doesn't really match what I would
> expect to see. In the beginning of a partition I would expect:
>
> 1) Partition starts
>
> 2) Things behave slowly for approximately net_ticktime (see
> http://www.rabbitmq.com/partitions.html#net_ticktime) as nodes
> attempt to contact each other and time out
>
> 3) On each node, Erlang decides the other nodes are down. Things
> speed up again, HA queues fail over. Split-brain has begun.
>
> It sounds like you were stuck in 2) for an extended period of time.
> Have you changed net_ticktime?
>
> Alternatively, when testing partitions I have see really odd
> behaviour by blocking network traffic in one direction only with
> iptables. You might want to check if you've done that by mistake.
>
> Cheers, Simon
>
> On 30/05/13 22:14, Maslinski, Ray wrote:
>> Follow-up question ...
>>
>> I tried some experiments to gain some understanding of how the
>> cluster behaved with clients attached during a network partition
>> event. Essentially, I repeated the previous tests described below
>> for autohealing and automatic queue synchronization, but left the
>> cluster communications port blocked while the client test
>> completed. One oddity I noticed was that while the consumer
>> connected to the slave appeared to receive an indication that
>> something was amiss (client log showed a consumer cancel exception
>> being handled by the Spring AMQP framework, and other monitoring
>> logs appeared to show the client restarting a connection, which
>> seems to be consistent with documentation), the consumer connected
>> to the master seemed to remain oblivious to any possible issues.
>> That consumer continued to receive messages, but at an extremely
>> slow rate (test published at 16/sec fixed rate, but the remaining
>> consumer began to receive messages at the rate of about 1 every 14
>> seconds).
>>
>> Since the test client waits around for expected message deliveries
>> with a resettable 30 second timeout, it continued to run for an
>> extended period of time (longer than I waited around for).  In
>> addition, the admin console showed a relatively small number of
>> unacked messages on that server, with the unacked count increasing
>> with each actual delivery (client should always be acknowledging
>> in the test setup, and reported no errors).  Eventually unblocking
>> the cluster port released a bunch of messages in a short interval
>> (albeit with some lost, as described previously).
>>
>> I also saw  producer connections go into flow control during the
>> outage and remain there during the slow consumer delivery (though
>> the test had long since completed delivering all its messages).
>>
>> Does this sound like expected behavior during a partition?
>>
>> Ray Maslinski Senior Software Developer, Engineering Valassis /
>> Digital Media Cell: 585.330.2426 maslinskir at valassis.com
>> www.valassis.com
>>
>> Creating the future of intelligent media delivery to drive your
>> greatest success
>>
>> ______________________________________________________________________
>>
>>
_______
>>
>> This message may include proprietary or protected information. If
>> you are not the intended recipient, please notify me, delete this
>> message and do not further communicate the information contained
>> herein without my express consent.
>>
>> -----Original Message----- From: Simon MacMullen
>> [mailto:simon at rabbitmq.com] Sent: Monday, May 20, 2013 6:30 AM To:
>> Discussions about RabbitMQ Cc: Maslinski, Ray Subject: Re:
>> [rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal
>> failures when recovering from cluster partition
>>
>> On 17/05/13 20:38, Maslinski, Ray wrote:
>>> Hello,
>>
>> Hi!
>>
>>> To simulate a network partition failure, I've been using iptables
>>> to temporarily block inbound and outbound access on one of the
>>> nodes to the single port configured for cluster communications
>>> through inet_dist_listen_min and inet_dist_listen_max settings
>>> (min = max). Client access is not blocked during a simulated
>>> partition fault.
>>
>> Sounds reasonable.
>>
>>> I've observed two anomalies during testing that I wasn't
>>> expecting based on the documentation I've read:
>>>
>>> -At a sufficiently high message rate, some number of messages
>>> will be lost during the fault sequence, with the number lost
>>> tending to increase with message rate.  No indication of a send
>>> error has been observed by the client program. Based on results
>>> obtained from test logs and an independent monitor listening on
>>> trace messages from each node, it appears that as soon as the
>>> port is blocked, both nodes continue to accept published
>>> messages, but (temporarily) stop delivering messages until the
>>> cluster heartbeat failure is detected, at which point the cluster
>>> is partitioned and the slave promotes itself to become master. In
>>> the sequences I've looked at, the messages that are lost all
>>> appear to be published to the original master (and final master
>>> after a winner is selected during autoheal). Neither the start
>>> nor the end of the lost message window appear to line up with any
>>> events in the logs, other than the start occurring sometime after
>>> the port connection is blocked but before the cluster heartbeat
>>> failure is detected, and the end occurring sometime after the
>>> detection of the cluster heartbeat failure and before the
>>> detection of the partitioned cluster after the connection is
>>> unblocked.  Is message loss to be expected in this scenario?
>>
>> I would expect to see message loss in a cluster heal scenario.
>>
>> It's important to remember that a cluster partition is still a
>> substantial problem, and the healing process involves throwing
>> state away. Autoheal mode just means you get through this process
>> faster, and hopefully spend much less time accepting messages that
>> will end up being lost.
>>
>> I would expect intuitively that only messages from the losing
>> partitions would be lost. But I am not entirely surprised if
>> messages from the winner are lost too; there is a period after the
>> partitions have come back together but before autoheal kicks in
>> during which we will have multiple masters for a queue, and
>> behaviour can be unpredictable.
>>
>>> -Occasionally the autoheal loser node fails to rejoin the
>>> cluster after restart.  I don't have a lot of data points on this
>>> one since it's only happened a handful of times during overnight
>>> test iterations.  During one failure, the autoheal winner showed
>>> the log message below during recovery:
>>
>> Ah, that looks like a bug in autoheal. I think the stack trace you
>> posted should contain enough information to fix it. Thanks.
>>
>> Cheers, Simon
>>
>> -- Simon MacMullen RabbitMQ, Pivotal
>>
>
>
> -- Simon MacMullen RabbitMQ, Pivotal
>

-- 
Simon MacMullen
RabbitMQ, Pivotal