[rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal failures when recovering from cluster partition

Mon Jun 3 16:07:43 BST 2013

Hmm. While the partition is in effect neither side should be aware that 
the other side is up at all - we assume if we lose contact with a node 
that it has crashed. So I'm quite surprised by the slow trickle you are 
seeing, since I'm sure people don't see that if part of a cluster just 
crashes.

So I am more suspicious that the iptables commands you are using are in 
fact not simulating a real partitioning of the cluster.

Can you post the iptables commands you are using and I will try to 
replicate?

Cheers, Simon

On 03/06/13 14:01, Maslinski, Ray wrote:
> For this particular test, I was trying to gain some insight as to the
> behavior of the cluster while the partition was in effect, so the
> partition wasn't resolved until the test was completed or aborted.
> My general expectation was that both sides of the partition would
> begin to process messages more or less normally once the cluster
> connection timed out and the slave promoted itself to master,
> possibly with some glitches such as dropped messages during the
> transition.  The slow trickle of messages delivered on one of the
> existing connections was unexpected, and seemed to be a bit of an
> anomaly when combined with the growing count of unacked messages
> shown on the console.
>
> Ray Maslinski Senior Software Developer, Engineering Valassis /
> Digital Media Cell: 585.330.2426 maslinskir at valassis.com
> www.valassis.com
>
> Creating the future of intelligent media delivery to drive your
> greatest success
>
> _____________________________________________________________________________
>
>  This message may include proprietary or protected information. If
> you are not the intended recipient, please notify me, delete this
> message and do not further communicate the information contained
> herein without my express consent.
>
> -----Original Message----- From: Simon MacMullen
> [mailto:simon at rabbitmq.com] Sent: Monday, June 03, 2013 6:37 AM To:
> Maslinski, Ray Cc: Discussions about RabbitMQ Subject: Re:
> [rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal failures
> when recovering from cluster partition
>
> Just to be clear - at what point in this test are you unblocking the
> connection so the partition can heal?
>
> Cheers, Simon
>
> On 31/05/13 20:04, Maslinski, Ray wrote:
>> The net_ticktime is using the default value, and event sequence
>> seems consistent with the associated timeout.
>>
>> Roughly,
>>
>> Start test
>>
>> Block cluster connection at some point -> trace data shows that
>> only publish events are occurring after this happens
>>
>> After a bit more than a minute, RabbitMQ server log shows lost
>> cluster connection on both nodes, along with the death of the
>> mirrors and promotion of slave to master.
>>
>> Test continues and completes publish phase, since these appeared to
>> be progressing normally, with no errors visible to client
>>
>> Test then starts to verify that all messages were received by
>> polling the contents of a buffer used to store incoming messages.
>> In a normal scenario without faults, this generally means that the
>> first poll finds all the messages expected to be present and test
>> completes successfully.  Otherwise, the poll rechecks for any
>> missing messages once per second, and gives up if 30 seconds elapse
>> without any new messages showing up (timer is reset if any messages
>> arrived during polling window).
>>
>> The slow trickle of messages occurs during the last phase, as one
>> appears to arrive every 14 seconds and the polling loop continues
>> to retry.  Trace data doesn't appear to log any deliver events
>> during this stage, but console shows a slow increment in the number
>> of unacknowledged messages that lines up with the delivery rate
>> observed by the client.
>>
>> Ray Maslinski Senior Software Developer, Engineering Valassis /
>> Digital Media Cell: 585.330.2426 maslinskir at valassis.com
>> www.valassis.com
>>
>> Creating the future of intelligent media delivery to drive your
>> greatest success
>>
>> ______________________________________________________________________
>>
>>
_______
>>
>> This message may include proprietary or protected information. If
>> you are not the intended recipient, please notify me, delete this
>> message and do not further communicate the information contained
>> herein without my express consent.
>>
>>
>> -----Original Message----- From: Simon MacMullen
>> [mailto:simon at rabbitmq.com] Sent: Friday, May 31, 2013 12:33 PM
>> To: Maslinski, Ray Cc: Discussions about RabbitMQ Subject: Re:
>> [rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal
>> failures when recovering from cluster partition
>>
>> Hi. The behaviour you describe doesn't really match what I would
>> expect to see. In the beginning of a partition I would expect:
>>
>> 1) Partition starts
>>
>> 2) Things behave slowly for approximately net_ticktime (see
>> http://www.rabbitmq.com/partitions.html#net_ticktime) as nodes
>> attempt to contact each other and time out
>>
>> 3) On each node, Erlang decides the other nodes are down. Things
>> speed up again, HA queues fail over. Split-brain has begun.
>>
>> It sounds like you were stuck in 2) for an extended period of
>> time. Have you changed net_ticktime?
>>
>> Alternatively, when testing partitions I have see really odd
>> behaviour by blocking network traffic in one direction only with
>> iptables. You might want to check if you've done that by mistake.
>>
>> Cheers, Simon
>>
>> On 30/05/13 22:14, Maslinski, Ray wrote:
>>> Follow-up question ...
>>>
>>> I tried some experiments to gain some understanding of how the
>>> cluster behaved with clients attached during a network partition
>>> event. Essentially, I repeated the previous tests described below
>>> for autohealing and automatic queue synchronization, but left the
>>> cluster communications port blocked while the client test
>>> completed. One oddity I noticed was that while the consumer
>>> connected to the slave appeared to receive an indication that
>>> something was amiss (client log showed a consumer cancel
>>> exception being handled by the Spring AMQP framework, and other
>>> monitoring logs appeared to show the client restarting a
>>> connection, which seems to be consistent with documentation), the
>>> consumer connected to the master seemed to remain oblivious to
>>> any possible issues. That consumer continued to receive messages,
>>> but at an extremely slow rate (test published at 16/sec fixed
>>> rate, but the remaining consumer began to receive messages at the
>>> rate of about 1 every 14 seconds).
>>>
>>> Since the test client waits around for expected message
>>> deliveries with a resettable 30 second timeout, it continued to
>>> run for an extended period of time (longer than I waited around
>>> for).  In addition, the admin console showed a relatively small
>>> number of unacked messages on that server, with the unacked count
>>> increasing with each actual delivery (client should always be
>>> acknowledging in the test setup, and reported no errors).
>>> Eventually unblocking the cluster port released a bunch of
>>> messages in a short interval (albeit with some lost, as described
>>> previously).
>>>
>>> I also saw  producer connections go into flow control during the
>>> outage and remain there during the slow consumer delivery (though
>>> the test had long since completed delivering all its messages).
>>>
>>> Does this sound like expected behavior during a partition?
>>>
>>> Ray Maslinski Senior Software Developer, Engineering Valassis /
>>> Digital Media Cell: 585.330.2426 maslinskir at valassis.com
>>> www.valassis.com
>>>
>>> Creating the future of intelligent media delivery to drive your
>>> greatest success
>>>
>>> _____________________________________________________________________
>>>
>>>
_
>>>
>>>
> _______
>>>
>>> This message may include proprietary or protected information. If
>>> you are not the intended recipient, please notify me, delete this
>>> message and do not further communicate the information contained
>>> herein without my express consent.
>>>
>>> -----Original Message----- From: Simon MacMullen
>>> [mailto:simon at rabbitmq.com] Sent: Monday, May 20, 2013 6:30 AM
>>> To: Discussions about RabbitMQ Cc: Maslinski, Ray Subject: Re:
>>> [rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal
>>> failures when recovering from cluster partition
>>>
>>> On 17/05/13 20:38, Maslinski, Ray wrote:
>>>> Hello,
>>>
>>> Hi!
>>>
>>>> To simulate a network partition failure, I've been using
>>>> iptables to temporarily block inbound and outbound access on
>>>> one of the nodes to the single port configured for cluster
>>>> communications through inet_dist_listen_min and
>>>> inet_dist_listen_max settings (min = max). Client access is not
>>>> blocked during a simulated partition fault.
>>>
>>> Sounds reasonable.
>>>
>>>> I've observed two anomalies during testing that I wasn't
>>>> expecting based on the documentation I've read:
>>>>
>>>> -At a sufficiently high message rate, some number of messages
>>>> will be lost during the fault sequence, with the number lost
>>>> tending to increase with message rate.  No indication of a send
>>>> error has been observed by the client program. Based on results
>>>> obtained from test logs and an independent monitor listening on
>>>> trace messages from each node, it appears that as soon as the
>>>> port is blocked, both nodes continue to accept published
>>>> messages, but (temporarily) stop delivering messages until the
>>>> cluster heartbeat failure is detected, at which point the
>>>> cluster is partitioned and the slave promotes itself to become
>>>> master. In the sequences I've looked at, the messages that are
>>>> lost all appear to be published to the original master (and
>>>> final master after a winner is selected during autoheal).
>>>> Neither the start nor the end of the lost message window appear
>>>> to line up with any events in the logs, other than the start
>>>> occurring sometime after the port connection is blocked but
>>>> before the cluster heartbeat failure is detected, and the end
>>>> occurring sometime after the detection of the cluster heartbeat
>>>> failure and before the detection of the partitioned cluster
>>>> after the connection is unblocked.  Is message loss to be
>>>> expected in this scenario?
>>>
>>> I would expect to see message loss in a cluster heal scenario.
>>>
>>> It's important to remember that a cluster partition is still a
>>> substantial problem, and the healing process involves throwing
>>> state away. Autoheal mode just means you get through this process
>>> faster, and hopefully spend much less time accepting messages
>>> that will end up being lost.
>>>
>>> I would expect intuitively that only messages from the losing
>>> partitions would be lost. But I am not entirely surprised if
>>> messages from the winner are lost too; there is a period after
>>> the partitions have come back together but before autoheal kicks
>>> in during which we will have multiple masters for a queue, and
>>> behaviour can be unpredictable.
>>>
>>>> -Occasionally the autoheal loser node fails to rejoin the
>>>> cluster after restart.  I don't have a lot of data points on
>>>> this one since it's only happened a handful of times during
>>>> overnight test iterations.  During one failure, the autoheal
>>>> winner showed the log message below during recovery:
>>>
>>> Ah, that looks like a bug in autoheal. I think the stack trace
>>> you posted should contain enough information to fix it. Thanks.
>>>
>>> Cheers, Simon
>>>
>>> -- Simon MacMullen RabbitMQ, Pivotal
>>>
>>
>>
>> -- Simon MacMullen RabbitMQ, Pivotal
>>
>
>
> -- Simon MacMullen RabbitMQ, Pivotal
>

-- 
Simon MacMullen
RabbitMQ, Pivotal