[rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal failures when recovering from cluster partition

Tue Jun 4 16:16:43 BST 2013

Yeah, that's roughly what I'm doing, and I don't see this trickle effect 
- once each side of the cluster decides the other one has gone down then 
we see things running at full speed.

I assume you are seeing node down messages in the logs on both sides / 
rabbitmqctl and management showing the other nodes as down in each case?

Cheers, Simon

On 03/06/13 16:17, Maslinski, Ray wrote:
> I've tried a few variants without noticing any apparent differences in behavior, here's what I currently have:
>
> 		sudo iptables -A INPUT -i eth0 -p tcp -s $address -j DROP
> 		sudo iptables -A OUTPUT -o eth0 -p tcp -d $address -j DROP
>
> where $address is the ip address of the other node in the mirror.
>
> The other variants included adding a specific port to only drop traffic on the cluster port between the two nodes (--dport $port to above commands), and dropping all traffic on the cluster port, regardless of ip address (adding the --dport option and removing the -s/-d options).
>
> Ray Maslinski
> Senior Software Developer, Engineering
> Valassis / Digital Media
> Cell: 585.330.2426
> maslinskir at valassis.com
> www.valassis.com
>
> Creating the future of intelligent media delivery to drive your greatest success
>
> _____________________________________________________________________________
>
> This message may include proprietary or protected information. If you are not the intended
> recipient, please notify me, delete this message and do not further communicate the information
> contained herein without my express consent.
>
> -----Original Message-----
> From: Simon MacMullen [mailto:simon at rabbitmq.com]
> Sent: Monday, June 03, 2013 11:08 AM
> To: Maslinski, Ray
> Cc: Discussions about RabbitMQ
> Subject: Re: [rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal failures when recovering from cluster partition
>
> Hmm. While the partition is in effect neither side should be aware that the other side is up at all - we assume if we lose contact with a node that it has crashed. So I'm quite surprised by the slow trickle you are seeing, since I'm sure people don't see that if part of a cluster just crashes.
>
> So I am more suspicious that the iptables commands you are using are in fact not simulating a real partitioning of the cluster.
>
> Can you post the iptables commands you are using and I will try to replicate?
>
> Cheers, Simon
>
> On 03/06/13 14:01, Maslinski, Ray wrote:
>> For this particular test, I was trying to gain some insight as to the
>> behavior of the cluster while the partition was in effect, so the
>> partition wasn't resolved until the test was completed or aborted.
>> My general expectation was that both sides of the partition would
>> begin to process messages more or less normally once the cluster
>> connection timed out and the slave promoted itself to master, possibly
>> with some glitches such as dropped messages during the transition.
>> The slow trickle of messages delivered on one of the existing
>> connections was unexpected, and seemed to be a bit of an anomaly when
>> combined with the growing count of unacked messages shown on the
>> console.
>>
>> Ray Maslinski Senior Software Developer, Engineering Valassis /
>> Digital Media Cell: 585.330.2426 maslinskir at valassis.com
>> www.valassis.com
>>
>> Creating the future of intelligent media delivery to drive your
>> greatest success
>>
>> ______________________________________________________________________
>> _______
>>
>>   This message may include proprietary or protected information. If you
>> are not the intended recipient, please notify me, delete this message
>> and do not further communicate the information contained herein
>> without my express consent.
>>
>> -----Original Message----- From: Simon MacMullen
>> [mailto:simon at rabbitmq.com] Sent: Monday, June 03, 2013 6:37 AM To:
>> Maslinski, Ray Cc: Discussions about RabbitMQ Subject: Re:
>> [rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal failures
>> when recovering from cluster partition
>>
>> Just to be clear - at what point in this test are you unblocking the
>> connection so the partition can heal?
>>
>> Cheers, Simon
>>
>> On 31/05/13 20:04, Maslinski, Ray wrote:
>>> The net_ticktime is using the default value, and event sequence seems
>>> consistent with the associated timeout.
>>>
>>> Roughly,
>>>
>>> Start test
>>>
>>> Block cluster connection at some point -> trace data shows that only
>>> publish events are occurring after this happens
>>>
>>> After a bit more than a minute, RabbitMQ server log shows lost
>>> cluster connection on both nodes, along with the death of the mirrors
>>> and promotion of slave to master.
>>>
>>> Test continues and completes publish phase, since these appeared to
>>> be progressing normally, with no errors visible to client
>>>
>>> Test then starts to verify that all messages were received by polling
>>> the contents of a buffer used to store incoming messages.
>>> In a normal scenario without faults, this generally means that the
>>> first poll finds all the messages expected to be present and test
>>> completes successfully.  Otherwise, the poll rechecks for any missing
>>> messages once per second, and gives up if 30 seconds elapse without
>>> any new messages showing up (timer is reset if any messages arrived
>>> during polling window).
>>>
>>> The slow trickle of messages occurs during the last phase, as one
>>> appears to arrive every 14 seconds and the polling loop continues to
>>> retry.  Trace data doesn't appear to log any deliver events during
>>> this stage, but console shows a slow increment in the number of
>>> unacknowledged messages that lines up with the delivery rate observed
>>> by the client.
>>>
>>> Ray Maslinski Senior Software Developer, Engineering Valassis /
>>> Digital Media Cell: 585.330.2426 maslinskir at valassis.com
>>> www.valassis.com
>>>
>>> Creating the future of intelligent media delivery to drive your
>>> greatest success
>>>
>>> _____________________________________________________________________
>>> _
>>>
>>>
> _______
>>>
>>> This message may include proprietary or protected information. If you
>>> are not the intended recipient, please notify me, delete this message
>>> and do not further communicate the information contained herein
>>> without my express consent.
>>>
>>>
>>> -----Original Message----- From: Simon MacMullen
>>> [mailto:simon at rabbitmq.com] Sent: Friday, May 31, 2013 12:33 PM
>>> To: Maslinski, Ray Cc: Discussions about RabbitMQ Subject: Re:
>>> [rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal failures
>>> when recovering from cluster partition
>>>
>>> Hi. The behaviour you describe doesn't really match what I would
>>> expect to see. In the beginning of a partition I would expect:
>>>
>>> 1) Partition starts
>>>
>>> 2) Things behave slowly for approximately net_ticktime (see
>>> http://www.rabbitmq.com/partitions.html#net_ticktime) as nodes
>>> attempt to contact each other and time out
>>>
>>> 3) On each node, Erlang decides the other nodes are down. Things
>>> speed up again, HA queues fail over. Split-brain has begun.
>>>
>>> It sounds like you were stuck in 2) for an extended period of time.
>>> Have you changed net_ticktime?
>>>
>>> Alternatively, when testing partitions I have see really odd
>>> behaviour by blocking network traffic in one direction only with
>>> iptables. You might want to check if you've done that by mistake.
>>>
>>> Cheers, Simon
>>>
>>> On 30/05/13 22:14, Maslinski, Ray wrote:
>>>> Follow-up question ...
>>>>
>>>> I tried some experiments to gain some understanding of how the
>>>> cluster behaved with clients attached during a network partition
>>>> event. Essentially, I repeated the previous tests described below
>>>> for autohealing and automatic queue synchronization, but left the
>>>> cluster communications port blocked while the client test completed.
>>>> One oddity I noticed was that while the consumer connected to the
>>>> slave appeared to receive an indication that something was amiss
>>>> (client log showed a consumer cancel exception being handled by the
>>>> Spring AMQP framework, and other monitoring logs appeared to show
>>>> the client restarting a connection, which seems to be consistent
>>>> with documentation), the consumer connected to the master seemed to
>>>> remain oblivious to any possible issues. That consumer continued to
>>>> receive messages, but at an extremely slow rate (test published at
>>>> 16/sec fixed rate, but the remaining consumer began to receive
>>>> messages at the rate of about 1 every 14 seconds).
>>>>
>>>> Since the test client waits around for expected message deliveries
>>>> with a resettable 30 second timeout, it continued to run for an
>>>> extended period of time (longer than I waited around for).  In
>>>> addition, the admin console showed a relatively small number of
>>>> unacked messages on that server, with the unacked count increasing
>>>> with each actual delivery (client should always be acknowledging in
>>>> the test setup, and reported no errors).
>>>> Eventually unblocking the cluster port released a bunch of messages
>>>> in a short interval (albeit with some lost, as described
>>>> previously).
>>>>
>>>> I also saw  producer connections go into flow control during the
>>>> outage and remain there during the slow consumer delivery (though
>>>> the test had long since completed delivering all its messages).
>>>>
>>>> Does this sound like expected behavior during a partition?
>>>>
>>>> Ray Maslinski Senior Software Developer, Engineering Valassis /
>>>> Digital Media Cell: 585.330.2426 maslinskir at valassis.com
>>>> www.valassis.com
>>>>
>>>> Creating the future of intelligent media delivery to drive your
>>>> greatest success
>>>>
>>>> ____________________________________________________________________
>>>> _
>>>>
>>>>
> _
>>>>
>>>>
>> _______
>>>>
>>>> This message may include proprietary or protected information. If
>>>> you are not the intended recipient, please notify me, delete this
>>>> message and do not further communicate the information contained
>>>> herein without my express consent.
>>>>
>>>> -----Original Message----- From: Simon MacMullen
>>>> [mailto:simon at rabbitmq.com] Sent: Monday, May 20, 2013 6:30 AM
>>>> To: Discussions about RabbitMQ Cc: Maslinski, Ray Subject: Re:
>>>> [rabbitmq-discuss] RabbitMQ 3.1.0 lost messages and autoheal
>>>> failures when recovering from cluster partition
>>>>
>>>> On 17/05/13 20:38, Maslinski, Ray wrote:
>>>>> Hello,
>>>>
>>>> Hi!
>>>>
>>>>> To simulate a network partition failure, I've been using iptables
>>>>> to temporarily block inbound and outbound access on one of the
>>>>> nodes to the single port configured for cluster communications
>>>>> through inet_dist_listen_min and inet_dist_listen_max settings (min
>>>>> = max). Client access is not blocked during a simulated partition
>>>>> fault.
>>>>
>>>> Sounds reasonable.
>>>>
>>>>> I've observed two anomalies during testing that I wasn't expecting
>>>>> based on the documentation I've read:
>>>>>
>>>>> -At a sufficiently high message rate, some number of messages will
>>>>> be lost during the fault sequence, with the number lost tending to
>>>>> increase with message rate.  No indication of a send error has been
>>>>> observed by the client program. Based on results obtained from test
>>>>> logs and an independent monitor listening on trace messages from
>>>>> each node, it appears that as soon as the port is blocked, both
>>>>> nodes continue to accept published messages, but (temporarily) stop
>>>>> delivering messages until the cluster heartbeat failure is
>>>>> detected, at which point the cluster is partitioned and the slave
>>>>> promotes itself to become master. In the sequences I've looked at,
>>>>> the messages that are lost all appear to be published to the
>>>>> original master (and final master after a winner is selected during
>>>>> autoheal).
>>>>> Neither the start nor the end of the lost message window appear to
>>>>> line up with any events in the logs, other than the start occurring
>>>>> sometime after the port connection is blocked but before the
>>>>> cluster heartbeat failure is detected, and the end occurring
>>>>> sometime after the detection of the cluster heartbeat failure and
>>>>> before the detection of the partitioned cluster after the
>>>>> connection is unblocked.  Is message loss to be expected in this
>>>>> scenario?
>>>>
>>>> I would expect to see message loss in a cluster heal scenario.
>>>>
>>>> It's important to remember that a cluster partition is still a
>>>> substantial problem, and the healing process involves throwing state
>>>> away. Autoheal mode just means you get through this process faster,
>>>> and hopefully spend much less time accepting messages that will end
>>>> up being lost.
>>>>
>>>> I would expect intuitively that only messages from the losing
>>>> partitions would be lost. But I am not entirely surprised if
>>>> messages from the winner are lost too; there is a period after the
>>>> partitions have come back together but before autoheal kicks in
>>>> during which we will have multiple masters for a queue, and
>>>> behaviour can be unpredictable.
>>>>
>>>>> -Occasionally the autoheal loser node fails to rejoin the cluster
>>>>> after restart.  I don't have a lot of data points on this one since
>>>>> it's only happened a handful of times during overnight test
>>>>> iterations.  During one failure, the autoheal winner showed the log
>>>>> message below during recovery:
>>>>
>>>> Ah, that looks like a bug in autoheal. I think the stack trace you
>>>> posted should contain enough information to fix it. Thanks.
>>>>
>>>> Cheers, Simon
>>>>
>>>> -- Simon MacMullen RabbitMQ, Pivotal
>>>>
>>>
>>>
>>> -- Simon MacMullen RabbitMQ, Pivotal
>>>
>>
>>
>> -- Simon MacMullen RabbitMQ, Pivotal
>>
>
>
> --
> Simon MacMullen
> RabbitMQ, Pivotal
>

-- 
Simon MacMullen
RabbitMQ, Pivotal