[rabbitmq-discuss] 3.0.4 extremely unstable in production...?

Fri Apr 19 16:06:16 BST 2013

We have also recently upgraded to 3.0.4 and have since then had 2 outages. 
In the one case the service was running but non functional. The logs didn't 
have errors, but at a certain point just stopped receiving new connections. 
We had to restart the service and all was well until about a week later 
when there were a lot of heaped up messages server side but clients could 
not connect to the queue anymore. (server actively refused connection 
message from the client side). We will be downgrading to 2.8.x in the mean 
time.

On Friday, April 12, 2013 8:36:22 PM UTC+2, Matt Wise wrote:
>
> We've been running RabbitMQ 2.8.x in production in Amazon for about 16 
> months now without very many issues. Last week we ran into an issue where 
> our 2.8.5 cluster nodes hit their high-memory-limit and stopped processing 
> jobs, effectively taking down our entire Celery task queue. We decided to 
> upgrade the software to 3.0.4 (which had been running in staging for a few 
> weeks, as a single instance, without issue) and at the same time beef up 
> the size and redundancy of our farm to 3 machines that were m1.larges.
>
> Old Farm:
>   server1: m1.small, 2.8.5, us-west-1c
>   server2: m1.small, 2.8.5, us-west-1c
>
> New Farm:
>   server1: m1.large, 3.0.4, us-west-1a
>   server2: m1.large, 3.0.4, us-west-1c
>   server3: m1.large, 3.0.4, us-west-1c
>
> Since creating the new server farm though we've had 3 outages. In the 
> first two outages we received a Network Partition Split, and effectively 
> all 3 of the systems decided to run their own queues independently of the 
> other servers. This was the first time we'd ever seen this failure, ever. 
> In the most recent failure we had 2 machines split off, and the 3rd 
> rabbitmq service effectively became unresponsive entirely.
>
> For sanity sake, at this point we've backed down to the following 
> configuration:
>
> New-New Farm:
>   server1: m1.large, 2.8.5, us-west-1c
>   server2: m1.large, 2.8.5, us-west-1a
>
> Up until recently though I had felt extremely comfortable with RabbitMQ's 
> clustering technology and reliability... now ... not so much. Has anyone 
> else seen similar behaviors? Is it simply due to the fact that we're 
> running cross-zone now in Amazon, or is it more likely the 3 servers that 
> caused the problem? Or the 3.0.x upgrade?
>
> --Matt
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130419/98d888cf/attachment.htm>