[rabbitmq-discuss] 3.0.4 extremely unstable in production...?

Wed Apr 17 16:33:01 BST 2013

We run a combination of clustered (across zones) and shovel-connected
nodes in several regions, mostly m1 smalls but bigger ones too.

We've experienced one partition in the last few months, a 2/1. Partitions
occur  because an OTP heartbeat cannot be processed within the time frame
set by the net_ticktime OTP parameter, as I recall. In EC2 that can occur
for several reasons, most commonly because the shared network resource is
being dominated by something else. In our case we discovered that due to
an oversight, the three instances were simultaneously engaged in high IO
activity unrelated to rabbitmq. A good way to possibly trigger this state
would be to simultaneously snapshot EBS disks attached to the instances,
especially if they are large.

To mitigate against this you can: 1) use larger instances with more IO
capability (bigger hose), more cores, and more memory (more schlitz) - we
may do this; 2) cluster with fewer instances so there is less 'internal
work' going on - we cluster w no more than 3, then scale up as needed (at
the core - the retail layer scales out using shovels); 3) architecturally
lighten the load by: avoiding persistent messages, never using
transactions, offloading large message bodies and passing references
instead (we use S3, cassandra, DynamoDB), emphasizing federation/shovels
for broad deployments, tolerating duplicate messages, keeping queues
short, etc.

In your case, running celery, I would suggest: run 2 instances; make them
m1 large or xlarge; use us-west-2 instead of us-west1 if you can (cheaper,
newer, bigger); if possible with celery and your app designs, avoid
persistent messages - and do the other stuff in 3) above.

Michael Laing
NYTimes

On 4/16/13 5:58 PM, "Matt Wise" <matt at nextdoor.com> wrote:

>Interesting... We are still running one node in us-west-1a, and one in
>us-west-1c. Today alone we saw 3 network glitches on the node in
>us-west-1a where it became unable to connect to remote services in other
>datacenters. Obviously the hardware or rack that machine is living on is
>having problems.
>
>The interesting part though is that RabbitMQ did not seem to report a
>network partition during these events now that we're running 2.8.5 and
>only two nodes instead of three. I'm still digging through the logs to
>see if there were other interruptions, but it still feels like the code
>is either:
>  a) more stable with 2 nodes
>  b) more fault-tolerant in 2.8.5 than it is in 3.0.4
>
>--Matt
>
>
>On Apr 15, 2013, at 2:38 AM, Simon MacMullen <simon at rabbitmq.com> wrote:
>
>> To add to what Emile said: the only difference between partition
>>handling in 2.x and 3.x is that 3.x will show a big red warning in
>>management when one has occurred whereas 2.x will stay silent. If you
>>still have logs from the 2.x days you might want to grep for
>>"running_partitioned_network" - I suspect you will find some matches.
>> 
>> The next release, 3.1 will have some features around automatic healing
>>of network partitions.
>> 
>> Cheers, Simon
>> 
>> On 15/04/2013 10:16, Emile Joubert wrote:
>>> 
>>> Hi,
>>> 
>>> On 12/04/13 19:36, Matt Wise wrote:
>>>> Since creating the new server farm though we've had 3 outages. In the
>>>> first two outages we received a Network Partition Split, and
>>>>effectively
>>>> all 3 of the systems decided to run their own queues independently of
>>>> the other servers. This was the first time we'd ever seen this
>>>>failure,
>>>> ever. In the most recent failure we had 2 machines split off, and the
>>>> 3rd rabbitmq service effectively became unresponsive entirely.
>>> 
>>> Versions 2.8.x and 3.0.x are equally susceptible to partitions. You can
>>> confirm this experimentally by setting up a cluster of v2.8.x nodes and
>>> interrupting connectivity for twice the net_ticktime (60s by default).
>>> 
>>> See https://www.rabbitmq.com/partitions.html
>>> 
>>>> Up until recently though I had felt extremely comfortable with
>>>> RabbitMQ's clustering technology and reliability... now ... not so
>>>>much.
>>>> Has anyone else seen similar behaviors? Is it simply due to the fact
>>>> that we're running cross-zone now in Amazon, or is it more likely the
>>>>3
>>>> servers that caused the problem? Or the 3.0.x upgrade?
>>> 
>>> A network outage coincided with the period when nodes were running
>>> v3.0.4. The network interruption is the cause of the partition rather
>>> than the broker version.
>>> 
>>> At the time of writing RabbitMQ clustering does not tolerate network
>>> partitions well, so it should not be used over a WAN. The shovel or
>>> federation plugins are better solutions for that case.
>>> 
>>> See http://www.rabbitmq.com/clustering.html
>>> 
>>> 
>>> 
>>> -Emile
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> rabbitmq-discuss mailing list
>>> rabbitmq-discuss at lists.rabbitmq.com
>>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>> 
>> 
>
>_______________________________________________
>rabbitmq-discuss mailing list
>rabbitmq-discuss at lists.rabbitmq.com
>https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss