[rabbitmq-discuss] 3.0.4 extremely unstable in production...?
Tim Watson
tim at rabbitmq.com
Mon Apr 22 12:10:47 BST 2013
Hi Jacques,
Have you posted details about this to the mailing list previously? I
didn't see anything specific from you in the last week or so.
Would you be able to provide logs and/or further information about your
setup? Obviously we're keen to track down any bugs that cause
operational issues and resolve them asap.
Cheers,
Tim
On 04/19/2013 04:06 PM, Jacques Doubell wrote:
> We have also recently upgraded to 3.0.4 and have since then had 2
> outages. In the one case the service was running but non functional.
> The logs didn't have errors, but at a certain point just stopped
> receiving new connections. We had to restart the service and all was
> well until about a week later when there were a lot of heaped up
> messages server side but clients could not connect to the queue
> anymore. (server actively refused connection message from the client
> side). We will be downgrading to 2.8.x in the mean time.
>
> On Friday, April 12, 2013 8:36:22 PM UTC+2, Matt Wise wrote:
>
> We've been running RabbitMQ 2.8.x in production in Amazon for
> about 16 months now without very many issues. Last week we ran
> into an issue where our 2.8.5 cluster nodes hit their
> high-memory-limit and stopped processing jobs, effectively taking
> down our entire Celery task queue. We decided to upgrade the
> software to 3.0.4 (which had been running in staging for a few
> weeks, as a single instance, without issue) and at the same time
> beef up the size and redundancy of our farm to 3 machines that
> were m1.larges.
>
> Old Farm:
> server1: m1.small, 2.8.5, us-west-1c
> server2: m1.small, 2.8.5, us-west-1c
>
> New Farm:
> server1: m1.large, 3.0.4, us-west-1a
> server2: m1.large, 3.0.4, us-west-1c
> server3: m1.large, 3.0.4, us-west-1c
>
> Since creating the new server farm though we've had 3 outages. In
> the first two outages we received a Network Partition Split, and
> effectively all 3 of the systems decided to run their own queues
> independently of the other servers. This was the first time we'd
> ever seen this failure, ever. In the most recent failure we had 2
> machines split off, and the 3rd rabbitmq service effectively
> became unresponsive entirely.
>
> For sanity sake, at this point we've backed down to the following
> configuration:
>
> New-New Farm:
> server1: m1.large, 2.8.5, us-west-1c
> server2: m1.large, 2.8.5, us-west-1a
>
> Up until recently though I had felt extremely comfortable with
> RabbitMQ's clustering technology and reliability... now ... not so
> much. Has anyone else seen similar behaviors? Is it simply due to
> the fact that we're running cross-zone now in Amazon, or is it
> more likely the 3 servers that caused the problem? Or the 3.0.x
> upgrade?
>
> --Matt
>
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130422/3214617d/attachment.htm>
More information about the rabbitmq-discuss
mailing list