<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Hi Jacques,<br>
<br>
Have you posted details about this to the mailing list previously? I
didn't see anything specific from you in the last week or so.<br>
<br>
Would you be able to provide logs and/or further information about
your setup? Obviously we're keen to track down any bugs that cause
operational issues and resolve them asap.<br>
<br>
Cheers,<br>
Tim<br>
<br>
On 04/19/2013 04:06 PM, Jacques Doubell wrote:
<blockquote
cite="mid:398f5f6d-22c0-49ec-989b-6cfae1416cda@googlegroups.com"
type="cite">We have also recently upgraded to 3.0.4 and have since
then had 2 outages. In the one case the service was running but
non functional. The logs didn't have errors, but at a certain
point just stopped receiving new connections. We had to restart
the service and all was well until about a week later when there
were a lot of heaped up messages server side but clients could not
connect to the queue anymore. (server actively refused connection
message from the client side). We will be downgrading to 2.8.x in
the mean time.<br>
<br>
On Friday, April 12, 2013 8:36:22 PM UTC+2, Matt Wise wrote:
<blockquote class="gmail_quote" style="margin: 0;margin-left:
0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">We've been
running RabbitMQ 2.8.x in production in Amazon for about 16
months now without very many issues. Last week we ran into an
issue where our 2.8.5 cluster nodes hit their high-memory-limit
and stopped processing jobs, effectively taking down our entire
Celery task queue. We decided to upgrade the software to 3.0.4
(which had been running in staging for a few weeks, as a single
instance, without issue) and at the same time beef up the size
and redundancy of our farm to 3 machines that were m1.larges.
<div><br>
</div>
<div>Old Farm:</div>
<div> server1: m1.small, 2.8.5, us-west-1c</div>
<div> server2: m1.small, 2.8.5, us-west-1c</div>
<div><br>
</div>
<div>New Farm:</div>
<div> server1: m1.large, 3.0.4, us-west-1a</div>
<div> server2: m1.large, 3.0.4, us-west-1c</div>
<div> server3: m1.large, 3.0.4, us-west-1c</div>
<div><br>
</div>
<div>Since creating the new server farm though we've had 3
outages. In the first two outages we received a Network
Partition Split, and effectively all 3 of the systems decided
to run their own queues independently of the other servers.
This was the first time we'd ever seen this failure, ever. In
the most recent failure we had 2 machines split off, and the
3rd rabbitmq service effectively became unresponsive entirely.</div>
<div><br>
</div>
<div>For sanity sake, at this point we've backed down to the
following configuration:</div>
<div><br>
</div>
<div>New-New Farm:</div>
<div> server1: m1.large, 2.8.5, us-west-1c</div>
<div> server2: m1.large, 2.8.5, us-west-1a</div>
<div><br>
</div>
<div>Up until recently though I had felt extremely comfortable
with RabbitMQ's clustering technology and reliability... now
... not so much. Has anyone else seen similar behaviors? Is it
simply due to the fact that we're running cross-zone now in
Amazon, or is it more likely the 3 servers that caused the
problem? Or the 3.0.x upgrade?</div>
<div><br>
</div>
<div>--Matt</div>
</blockquote>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
rabbitmq-discuss mailing list
<a class="moz-txt-link-abbreviated" href="mailto:rabbitmq-discuss@lists.rabbitmq.com">rabbitmq-discuss@lists.rabbitmq.com</a>
<a class="moz-txt-link-freetext" href="https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss">https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss</a>
</pre>
</blockquote>
<br>
</body>
</html>