<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; color: rgb(0, 0, 0); font-size: 14px; font-family: Calibri, sans-serif; "><div>Hi Ray,</div><div><br></div><div>We run several clusters at any one time and have not had problems such as you report – yet :)</div><ul><li>3 nodes in a cluster / 3.04 or 3.10 / instance sizes vary </li><li>Deployed within a VPC and with each node in a different availability zone.</li><li>Amazon Linux and vFabric Erlang/OTP 15B02.</li><li>CloudFormation for automated deployment/autoscaling/DNS (Route53)/etc</li></ul><div>Like you, we do not use persistent messages. We persist in Cassandra and S3.</div><div><br></div><div>Things I have learned over the years re EC2 that may help:</div><ul><li>Avoid us-east-1: crowded, older infrastructure, bigger swings in capacity, meltdowns. My current favorite: us-west-2.</li><li>Watch IO Wait on your instances: It seems to reflect the current network environment in which you are operating – neighbor instance activity, snapshot activity, and your own IO. The partitions we have had have correlated with high IO Wait.</li><li>If you have a problem with an instance, start a new one to replace it, then diagnose the old.</li><li>Go multi-region. When a zone has big problems usually the regional control plane becomes compromised so resource changes fail. We typically run multi-headed in 3 regions to improve both availability and end user latency.</li></ul><div>We also have a 'backup' deployment architecture that uses federation/shovels across zones similar to our multi-region architecture. So far we haven't needed it.</div><div><br></div><div>In general, our approach is to ensure that messages are delivered at least once, and that operations are idempotent. Resolvers de-duplicate messages and report message history. History patterns tell us in near real time where problems (missing messages, increased latencies) are occurring.</div><div><br></div><div>Michael</div><span id="OLK_SRC_BODY_SECTION"><div style="font-family:Calibri; font-size:11pt; text-align:left; color:black; BORDER-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM: 0in; PADDING-LEFT: 0in; PADDING-RIGHT: 0in; BORDER-TOP: #b5c4df 1pt solid; BORDER-RIGHT: medium none; PADDING-TOP: 3pt"><span style="font-weight:bold">From: </span> <Maslinski>, Ray <<a href="mailto:MaslinskiR@valassis.com">MaslinskiR@valassis.com</a>><br><span style="font-weight:bold">Reply-To: </span> rabbitmq <<a href="mailto:rabbitmq-discuss@lists.rabbitmq.com">rabbitmq-discuss@lists.rabbitmq.com</a>><br><span style="font-weight:bold">Date: </span> Friday, May 17, 2013 4:03 PM<br><span style="font-weight:bold">To: </span> rabbitmq <<a href="mailto:rabbitmq-discuss@lists.rabbitmq.com">rabbitmq-discuss@lists.rabbitmq.com</a>><br><span style="font-weight:bold">Subject: </span> [rabbitmq-discuss] Amazon EC2 spurious cluster timeouts<br></div><div><br></div><div xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><meta name="Generator" content="Microsoft Word 14 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
span.EmailStyle17
        {mso-style-type:personal-compose;
        font-family:"Calibri","sans-serif";
        color:windowtext;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-family:"Calibri","sans-serif";}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--><div lang="EN-US" link="blue" vlink="purple"><div class="WordSection1"><p class="MsoNormal">Hello,<o:p></o:p></p><p class="MsoNormal"><o:p> </o:p></p><p class="MsoNormal">I’ve been working with several two node clusters running various versions of 3.0.x, hosted on m1.small instances on Ubuntu 12.04.2 LTS in EC2. The setup is essentially as described here
<a href="http://karlgrz.com/rabbitmq-highly-available-queues-and-clustering-using-amazon-ec2/">
http://karlgrz.com/rabbitmq-highly-available-queues-and-clustering-using-amazon-ec2/</a> with the main exception being that both of the RabbitMQ servers are in the same availability zone. A while back I observed a half dozen or so occurrences over the course
of a week where the clusters would become partitioned, accompanied by a messages on each server such as:<o:p></o:p></p><p class="MsoNormal"><o:p> </o:p></p><p class="MsoNormal">=ERROR REPORT==== 17-May-2013::01:56:45 ===<o:p></o:p></p><p class="MsoNormal">** Node 'rabbit@oemsg-new-29b15241' not responding **<o:p></o:p></p><p class="MsoNormal">** Removing (timedout) connection **<o:p></o:p></p><p class="MsoNormal"><o:p> </o:p></p><p class="MsoNormal">=INFO REPORT==== 17-May-2013::01:56:45 ===<o:p></o:p></p><p class="MsoNormal">rabbit on node 'rabbit@oemsg-new-29b15241' down<o:p></o:p></p><p class="MsoNormal"><o:p> </o:p></p><p class="MsoNormal">Looking over the logs and EC2 metrics, I wasn’t able to identify any other anomalies that coincided with these failures. In particular, the load balancers in front of the cluster nodes did not report any health check failures connecting
to the amqp port (on a 30 second interval), suggesting that network connectivity was otherwise healthy, and there didn’t appear to be any unexpected spikes in resource consumption (such as excessive cpu/disk/network activity). The rabbit servers were fairly
lightly loaded with messaging traffic at the time, and running some load tests against the same servers afterwards didn’t induce any further failures over the course of several days. I tried increasing the net_ticktime to something like 5 or 10 minutes, but
still observed a failure with the new value.<o:p></o:p></p><p class="MsoNormal"><o:p> </o:p></p><p class="MsoNormal">I left several clusters running over an extended period, most with little or no load, with one cluster running under an extended load test. Several of the clusters experienced no failures over the course of a couple of months, while others
became partitioned after a while (though they seemed to survive for at least a few weeks before partition).<o:p></o:p></p><p class="MsoNormal"><o:p> </o:p></p><p class="MsoNormal">Anyone experience anything similar in EC2, or have any ideas what else might be done to diagnose what’s going on?<o:p></o:p></p><p class="MsoNormal"><o:p> </o:p></p><p class="MsoNormal"><b><span style="font-size: 10pt; font-family: Arial, sans-serif; ">Ray Maslinski<o:p></o:p></span></b></p><p class="MsoNormal"><span style="font-size: 10pt; font-family: Arial, sans-serif; ">Senior Software Developer, Engineering</span><o:p></o:p></p><p class="MsoNormal"><span style="font-size: 10pt; font-family: Arial, sans-serif; ">Valassis / Digital Media</span><o:p></o:p></p><p class="MsoNormal"><span style="font-size: 10pt; font-family: Arial, sans-serif; ">Cell: 585.330.2426</span><o:p></o:p></p><p class="MsoNormal"><span style="font-size: 10pt; font-family: Arial, sans-serif; "><a href="mailto:maslinskir@valassis.com">maslinskir@valassis.com</a></span><o:p></o:p></p><p class="MsoNormal"><a href="http://www.valassis.com/"><span style="font-size: 10pt; color: windowtext; font-family: Arial, sans-serif; ">www.valassis.com</span></a><span style="font-size: 10pt; font-family: Arial, sans-serif; "><o:p></o:p></span></p><p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p><p class="MsoNormal"><span style="font-size: 10pt; color: black; font-family: Arial, sans-serif; ">Creating the future of intelligent media delivery to drive your greatest success<br></span><span style="font-size: 10.5pt; color: black; font-family: Arial, sans-serif; "><br></span><b><span style="font-size: 10pt; color: black; font-family: Arial, sans-serif; ">_____________________________________________________________________________</span></b><span style="font-size: 10pt; color: black; font-family: Arial, sans-serif; "><br><br>
This message may include proprietary or protected information. If you are not the intended <o:p></o:p></span></p><p class="MsoNormal"><span style="font-size: 10pt; color: black; font-family: Arial, sans-serif; ">recipient, please notify me, delete this message and do not further communicate the information <br>
contained herein without my express consent.<o:p></o:p></span></p><p class="MsoNormal"><o:p> </o:p></p></div></div></div></span></body></html>