[rabbitmq-discuss] Amazon EC2 spurious cluster timeouts

Sat May 18 14:01:48 BST 2013

Maslinski, Ray <MaslinskiR at ...> writes:

> 
> 
> 
> Hello,
>  
> I’ve been working with several two node clusters running various versions 
of 3.0.x, hosted on m1.small instances on Ubuntu 12.04.2 LTS in EC2.  The 
setup is essentially as described here
> 
> http://karlgrz.com/rabbitmq-highly-available-queues-and-clustering-using-
amazon-ec2/ with the main exception being that both of the RabbitMQ servers 
are in the same availability zone.  A while back I observed a half dozen or 
so occurrences over the course
>  of a week where the clusters would become partitioned, accompanied by a 
messages on each server such as:
>  
> =ERROR REPORT==== 17-May-2013::01:56:45 ===
> ** Node 'rabbit <at> oemsg-new-29b15241' not responding **
> ** Removing (timedout) connection **
>  
> =INFO REPORT==== 17-May-2013::01:56:45 ===
> rabbit on node 'rabbit <at> oemsg-new-29b15241' down
>  
> Looking over the logs and EC2 metrics, I wasn’t able to identify any other 
anomalies that coincided with these failures.  In particular, the load 
balancers in front of the cluster nodes did not report any health check 
failures connecting
>  to the amqp port (on a 30 second interval), suggesting that network 
connectivity was otherwise healthy, and there didn’t appear to be any 
unexpected spikes in resource consumption (such as excessive 
cpu/disk/network activity).  The rabbit servers were fairly
>  lightly loaded with messaging traffic at the time, and running some load 
tests against the same servers afterwards didn’t induce any further failures 
over the course of several days.  I tried increasing the net_ticktime to 
something like 5 or 10 minutes, but
>  still observed a failure with the new value.
>  
> I left several clusters running over an extended period, most with little 
or no load, with one cluster running under an extended load test.  Several 
of the clusters experienced no failures over the course of a couple of 
months, while others
>  became partitioned after a while (though they seemed to survive for at 
least a few weeks before partition).
>  
> Anyone experience anything similar in EC2, or have any ideas what else 
might be done to diagnose what’s going on?
>  
> Ray Maslinski
> Senior Software Developer, Engineering
> Valassis / Digital Media
> Cell: 585.330.2426
> maslinskir at valassis.com
> www.valassis.com
>  
> Creating the future of intelligent media delivery to drive your greatest 
success_____________________________________________________________________
________
> This message may include proprietary or protected information. If you are 
not the intended 
> recipient, please notify me, delete this message and do not further 
communicate the information 
> contained herein without my express consent.
>  
> 
> 
> 
> <div>
> <div class="WordSection1">
> <p class="MsoNormal">Hello,<p></p></p>
> <p class="MsoNormal"><p>&nbsp;</p></p>
> <p class="MsoNormal">I&rsquo;ve been working with several two node 
clusters running various versions of 3.0.x, hosted on m1.small instances on 
Ubuntu 12.04.2 LTS in EC2.&nbsp; The setup is essentially as described here
> <a href="http://karlgrz.com/rabbitmq-highly-available-queues-and-
clustering-using-amazon-ec2/">
> http://karlgrz.com/rabbitmq-highly-available-queues-and-clustering-using-
amazon-ec2/</a> with the main exception being that both of the RabbitMQ 
servers are in the same availability zone.&nbsp; A while back I observed a 
half dozen or so occurrences over the course
>  of a week where the clusters would become partitioned, accompanied by a 
messages on each server such as:<p></p></p>
> <p class="MsoNormal"><p>&nbsp;</p></p>
> <p class="MsoNormal">=ERROR REPORT==== 17-May-2013::01:56:45 ===<p></p>
</p>
> <p class="MsoNormal">** Node 'rabbit <at> oemsg-new-29b15241' not 
responding **≤p></p></p>
> <p class="MsoNormal">** Removing (timedout) connection **≤p></p></p>
> <p class="MsoNormal"><p>&nbsp;</p></p>
> <p class="MsoNormal">=INFO REPORT==== 17-May-2013::01:56:45 ===<p></p></p>
> <p class="MsoNormal">rabbit on node 'rabbit <at> oemsg-new-29b15241' 
down<p></p></p>
> <p class="MsoNormal"><p>&nbsp;</p></p>
> <p class="MsoNormal">Looking over the logs and EC2 metrics, I wasn&rsquo;t 
able to identify any other anomalies that coincided with these 
failures.&nbsp; In particular, the load balancers in front of the cluster 
nodes did not report any health check failures connecting
>  to the amqp port (on a 30 second interval), suggesting that network 
connectivity was otherwise healthy, and there didn&rsquo;t appear to be any 
unexpected spikes in resource consumption (such as excessive 
cpu/disk/network activity).&nbsp; The rabbit servers were fairly
>  lightly loaded with messaging traffic at the time, and running some load 
tests against the same servers afterwards didn&rsquo;t induce any further 
failures over the course of several days.&nbsp; I tried increasing the 
net_ticktime to something like 5 or 10 minutes, but
>  still observed a failure with the new value.<p></p></p>
> <p class="MsoNormal"><p>&nbsp;</p></p>
> <p class="MsoNormal">I left several clusters running over an extended 
period, most with little or no load, with one cluster running under an 
extended load test.&nbsp; Several of the clusters experienced no failures 
over the course of a couple of months, while others
>  became partitioned after a while (though they seemed to survive for at 
least a few weeks before partition).<p></p></p>
> <p class="MsoNormal"><p>&nbsp;</p></p>
> <p class="MsoNormal">Anyone experience anything similar in EC2, or have 
any ideas what else might be done to diagnose what&rsquo;s going on?<p></p>
</p>
> <p class="MsoNormal"><p>&nbsp;</p></p>
> <p class="MsoNormal"><span>Ray Maslinski<p></p></span></p>
> <p class="MsoNormal"><span>Senior Software Developer, Engineering</span>
<p></p></p>
> <p class="MsoNormal"><span>Valassis / Digital Media</span><p></p></p>
> <p class="MsoNormal"><span>Cell: 585.330.2426</span><p></p></p>
> <p class="MsoNormal"><span>maslinskir at ...</span><p></p></p>
> <p class="MsoNormal"><a href="http://www.valassis.com/">
<span>www.valassis.com</span></a><span><p></p></span></p>
> <p class="MsoNormal"><span><p>&nbsp;</p></span></p>
> <p class="MsoNormal"><span>Creating the future of intelligent media 
delivery to drive your greatest success<br></span><span><br></span>
<span>______________________________________________________________________
_______</span><span><br><br>
> This message may include proprietary or protected information. If you are 
not the intended&nbsp;<p></p></span></p>
> <p class="MsoNormal"><span>recipient, please notify me, delete this 
message and do not further communicate the information&nbsp;<br>
> contained herein without my express consent.<p></p></span></p>
> <p class="MsoNormal"><p>&nbsp;</p></p>
> </div>
> </div>
> 

Hi Ray,

One last thing I forgot to mention:  we only experience this problem in the 
AWS public cloud with our own AMIs. Essentially we take a stock Ubuntu 12.04 
LTS AMI, set it up using Chef so it has all the appropriate users, packages, 
etc, then we stop that instance and create a new from it (we "bake" a new 
instance).  When we deploy, we use our newly baked instances.

We do a lot of staging and demo work in the AWS VPC, and we have never 
experienced the problems there.  In the VPC we don't use our baked AMIs and 
instead completely rely on Chef to bring up and prepare all our nodes.

Don't know if this helps at all to narrow down the possible issues.

-Karl