[rabbitmq-discuss] 3.0.4 extremely unstable in production...?

Mon Apr 22 12:10:47 BST 2013

Hi Jacques,

Have you posted details about this to the mailing list previously? I 
didn't see anything specific from you in the last week or so.

Would you be able to provide logs and/or further information about your 
setup? Obviously we're keen to track down any bugs that cause 
operational issues and resolve them asap.

Cheers,
Tim

On 04/19/2013 04:06 PM, Jacques Doubell wrote:
> We have also recently upgraded to 3.0.4 and have since then had 2 
> outages. In the one case the service was running but non functional. 
> The logs didn't have errors, but at a certain point just stopped 
> receiving new connections. We had to restart the service and all was 
> well until about a week later when there were a lot of heaped up 
> messages server side but clients could not connect to the queue 
> anymore. (server actively refused connection message from the client 
> side). We will be downgrading to 2.8.x in the mean time.
>
> On Friday, April 12, 2013 8:36:22 PM UTC+2, Matt Wise wrote:
>
>     We've been running RabbitMQ 2.8.x in production in Amazon for
>     about 16 months now without very many issues. Last week we ran
>     into an issue where our 2.8.5 cluster nodes hit their
>     high-memory-limit and stopped processing jobs, effectively taking
>     down our entire Celery task queue. We decided to upgrade the
>     software to 3.0.4 (which had been running in staging for a few
>     weeks, as a single instance, without issue) and at the same time
>     beef up the size and redundancy of our farm to 3 machines that
>     were m1.larges.
>
>     Old Farm:
>       server1: m1.small, 2.8.5, us-west-1c
>       server2: m1.small, 2.8.5, us-west-1c
>
>     New Farm:
>       server1: m1.large, 3.0.4, us-west-1a
>       server2: m1.large, 3.0.4, us-west-1c
>       server3: m1.large, 3.0.4, us-west-1c
>
>     Since creating the new server farm though we've had 3 outages. In
>     the first two outages we received a Network Partition Split, and
>     effectively all 3 of the systems decided to run their own queues
>     independently of the other servers. This was the first time we'd
>     ever seen this failure, ever. In the most recent failure we had 2
>     machines split off, and the 3rd rabbitmq service effectively
>     became unresponsive entirely.
>
>     For sanity sake, at this point we've backed down to the following
>     configuration:
>
>     New-New Farm:
>       server1: m1.large, 2.8.5, us-west-1c
>       server2: m1.large, 2.8.5, us-west-1a
>
>     Up until recently though I had felt extremely comfortable with
>     RabbitMQ's clustering technology and reliability... now ... not so
>     much. Has anyone else seen similar behaviors? Is it simply due to
>     the fact that we're running cross-zone now in Amazon, or is it
>     more likely the 3 servers that caused the problem? Or the 3.0.x
>     upgrade?
>
>     --Matt
>
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130422/3214617d/attachment.htm>