[rabbitmq-discuss] Amazon EC2 spurious cluster timeouts
Karl Rieb
karl.rieb at gmail.com
Sat May 18 13:50:12 BST 2013
Maslinski, Ray <MaslinskiR at ...> writes:
>
>
>
> Hello,
>
> I’ve been working with several two node clusters running various versions
of 3.0.x, hosted on m1.small instances on Ubuntu 12.04.2 LTS in EC2. The
setup is essentially as described here
>
> http://karlgrz.com/rabbitmq-highly-available-queues-and-clustering-using-
amazon-ec2/ with the main exception being that both of the RabbitMQ servers
are in the same availability zone. A while back I observed a half dozen or
so occurrences over the course
> of a week where the clusters would become partitioned, accompanied by a
messages on each server such as:
>
> =ERROR REPORT==== 17-May-2013::01:56:45 ===
> ** Node 'rabbit <at> oemsg-new-29b15241' not responding **
> ** Removing (timedout) connection **
>
> =INFO REPORT==== 17-May-2013::01:56:45 ===
> rabbit on node 'rabbit <at> oemsg-new-29b15241' down
>
> Looking over the logs and EC2 metrics, I wasn’t able to identify any other
anomalies that coincided with these failures. In particular, the load
balancers in front of the cluster nodes did not report any health check
failures connecting
> to the amqp port (on a 30 second interval), suggesting that network
connectivity was otherwise healthy, and there didn’t appear to be any
unexpected spikes in resource consumption (such as excessive
cpu/disk/network activity). The rabbit servers were fairly
> lightly loaded with messaging traffic at the time, and running some load
tests against the same servers afterwards didn’t induce any further failures
over the course of several days. I tried increasing the net_ticktime to
something like 5 or 10 minutes, but
> still observed a failure with the new value.
>
> I left several clusters running over an extended period, most with little
or no load, with one cluster running under an extended load test. Several
of the clusters experienced no failures over the course of a couple of
months, while others
> became partitioned after a while (though they seemed to survive for at
least a few weeks before partition).
>
> Anyone experience anything similar in EC2, or have any ideas what else
might be done to diagnose what’s going on?
>
> Ray Maslinski
> Senior Software Developer, Engineering
> Valassis / Digital Media
> Cell: 585.330.2426
> maslinskir at valassis.com
> www.valassis.com
>
> Creating the future of intelligent media delivery to drive your greatest
success_____________________________________________________________________
________
> This message may include proprietary or protected information. If you are
not the intended
> recipient, please notify me, delete this message and do not further
communicate the information
> contained herein without my express consent.
>
>
>
>
> <div>
> <div class="WordSection1">
> <p class="MsoNormal">Hello,<p></p></p>
> <p class="MsoNormal"><p> </p></p>
> <p class="MsoNormal">I’ve been working with several two node
clusters running various versions of 3.0.x, hosted on m1.small instances on
Ubuntu 12.04.2 LTS in EC2. The setup is essentially as described here
> <a href="http://karlgrz.com/rabbitmq-highly-available-queues-and-
clustering-using-amazon-ec2/">
> http://karlgrz.com/rabbitmq-highly-available-queues-and-clustering-using-
amazon-ec2/</a> with the main exception being that both of the RabbitMQ
servers are in the same availability zone. A while back I observed a
half dozen or so occurrences over the course
> of a week where the clusters would become partitioned, accompanied by a
messages on each server such as:<p></p></p>
> <p class="MsoNormal"><p> </p></p>
> <p class="MsoNormal">=ERROR REPORT==== 17-May-2013::01:56:45 ===<p></p>
</p>
> <p class="MsoNormal">** Node 'rabbit <at> oemsg-new-29b15241' not
responding **≤p></p></p>
> <p class="MsoNormal">** Removing (timedout) connection **≤p></p></p>
> <p class="MsoNormal"><p> </p></p>
> <p class="MsoNormal">=INFO REPORT==== 17-May-2013::01:56:45 ===<p></p></p>
> <p class="MsoNormal">rabbit on node 'rabbit <at> oemsg-new-29b15241'
down<p></p></p>
> <p class="MsoNormal"><p> </p></p>
> <p class="MsoNormal">Looking over the logs and EC2 metrics, I wasn’t
able to identify any other anomalies that coincided with these
failures. In particular, the load balancers in front of the cluster
nodes did not report any health check failures connecting
> to the amqp port (on a 30 second interval), suggesting that network
connectivity was otherwise healthy, and there didn’t appear to be any
unexpected spikes in resource consumption (such as excessive
cpu/disk/network activity). The rabbit servers were fairly
> lightly loaded with messaging traffic at the time, and running some load
tests against the same servers afterwards didn’t induce any further
failures over the course of several days. I tried increasing the
net_ticktime to something like 5 or 10 minutes, but
> still observed a failure with the new value.<p></p></p>
> <p class="MsoNormal"><p> </p></p>
> <p class="MsoNormal">I left several clusters running over an extended
period, most with little or no load, with one cluster running under an
extended load test. Several of the clusters experienced no failures
over the course of a couple of months, while others
> became partitioned after a while (though they seemed to survive for at
least a few weeks before partition).<p></p></p>
> <p class="MsoNormal"><p> </p></p>
> <p class="MsoNormal">Anyone experience anything similar in EC2, or have
any ideas what else might be done to diagnose what’s going on?<p></p>
</p>
> <p class="MsoNormal"><p> </p></p>
> <p class="MsoNormal"><span>Ray Maslinski<p></p></span></p>
> <p class="MsoNormal"><span>Senior Software Developer, Engineering</span>
<p></p></p>
> <p class="MsoNormal"><span>Valassis / Digital Media</span><p></p></p>
> <p class="MsoNormal"><span>Cell: 585.330.2426</span><p></p></p>
> <p class="MsoNormal"><span>maslinskir at ...</span><p></p></p>
> <p class="MsoNormal"><a href="http://www.valassis.com/">
<span>www.valassis.com</span></a><span><p></p></span></p>
> <p class="MsoNormal"><span><p> </p></span></p>
> <p class="MsoNormal"><span>Creating the future of intelligent media
delivery to drive your greatest success<br></span><span><br></span>
<span>______________________________________________________________________
_______</span><span><br><br>
> This message may include proprietary or protected information. If you are
not the intended <p></p></span></p>
> <p class="MsoNormal"><span>recipient, please notify me, delete this
message and do not further communicate the information <br>
> contained herein without my express consent.<p></p></span></p>
> <p class="MsoNormal"><p> </p></p>
> </div>
> </div>
>
Hi Ray,
We used 3.0.4 and 3.1.0 versions of RabbitMQ in a cluster on Ubuntu 12.04
LTS in EC2 on c1.xlarge instances. We are in the us-east-1a availability
zone and have noticed that periodically (seemingly at random times), there
appears to be a network blip across most of our instances. Every instance
with a connection to our postgresql server will suddenly have their
connections timeout (with socket timeout errors). This is across all our
instances! At essentially the exact same time! Then, we notice RabbitMQ do
really strange things (notably, my recent post about RabbitMQ crashing
during subtract_acks call because a queue just disappears).
We have been in contact with AWS support, but so far they aren't sure what
is wrong. For the most part our services can eventually recover from the
network blip (eventually being like 30 minutes)... but we don't really
survive the rabbit node crashing and never coming back up. So far AWS has
suggested we perform packet captures and send them the captures during one
of the crashes.
Let me know if you guys figure out anything.
-Karl
More information about the rabbitmq-discuss
mailing list