[rabbitmq-discuss] Amazon EC2 spurious cluster timeouts

Fri May 17 21:03:36 BST 2013

Hello,

I've been working with several two node clusters running various versions of 3.0.x, hosted on m1.small instances on Ubuntu 12.04.2 LTS in EC2.  The setup is essentially as described here http://karlgrz.com/rabbitmq-highly-available-queues-and-clustering-using-amazon-ec2/ with the main exception being that both of the RabbitMQ servers are in the same availability zone.  A while back I observed a half dozen or so occurrences over the course of a week where the clusters would become partitioned, accompanied by a messages on each server such as:

=ERROR REPORT==== 17-May-2013::01:56:45 ===
** Node 'rabbit at oemsg-new-29b15241' not responding **
** Removing (timedout) connection **

=INFO REPORT==== 17-May-2013::01:56:45 ===
rabbit on node 'rabbit at oemsg-new-29b15241' down

Looking over the logs and EC2 metrics, I wasn't able to identify any other anomalies that coincided with these failures.  In particular, the load balancers in front of the cluster nodes did not report any health check failures connecting to the amqp port (on a 30 second interval), suggesting that network connectivity was otherwise healthy, and there didn't appear to be any unexpected spikes in resource consumption (such as excessive cpu/disk/network activity).  The rabbit servers were fairly lightly loaded with messaging traffic at the time, and running some load tests against the same servers afterwards didn't induce any further failures over the course of several days.  I tried increasing the net_ticktime to something like 5 or 10 minutes, but still observed a failure with the new value.

I left several clusters running over an extended period, most with little or no load, with one cluster running under an extended load test.  Several of the clusters experienced no failures over the course of a couple of months, while others became partitioned after a while (though they seemed to survive for at least a few weeks before partition).

Anyone experience anything similar in EC2, or have any ideas what else might be done to diagnose what's going on?

Ray Maslinski
Senior Software Developer, Engineering
Valassis / Digital Media
Cell: 585.330.2426
maslinskir at valassis.com
www.valassis.com<http://www.valassis.com/>

Creating the future of intelligent media delivery to drive your greatest success

_____________________________________________________________________________

This message may include proprietary or protected information. If you are not the intended
recipient, please notify me, delete this message and do not further communicate the information
contained herein without my express consent.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130517/5e89b817/attachment.htm>