[rabbitmq-discuss] 3.0.4 extremely unstable in production...?
Matt Wise
matt at nextdoor.com
Fri Apr 12 19:36:22 BST 2013
We've been running RabbitMQ 2.8.x in production in Amazon for about 16
months now without very many issues. Last week we ran into an issue where
our 2.8.5 cluster nodes hit their high-memory-limit and stopped processing
jobs, effectively taking down our entire Celery task queue. We decided to
upgrade the software to 3.0.4 (which had been running in staging for a few
weeks, as a single instance, without issue) and at the same time beef up
the size and redundancy of our farm to 3 machines that were m1.larges.
Old Farm:
server1: m1.small, 2.8.5, us-west-1c
server2: m1.small, 2.8.5, us-west-1c
New Farm:
server1: m1.large, 3.0.4, us-west-1a
server2: m1.large, 3.0.4, us-west-1c
server3: m1.large, 3.0.4, us-west-1c
Since creating the new server farm though we've had 3 outages. In the first
two outages we received a Network Partition Split, and effectively all 3 of
the systems decided to run their own queues independently of the other
servers. This was the first time we'd ever seen this failure, ever. In the
most recent failure we had 2 machines split off, and the 3rd rabbitmq
service effectively became unresponsive entirely.
For sanity sake, at this point we've backed down to the following
configuration:
New-New Farm:
server1: m1.large, 2.8.5, us-west-1c
server2: m1.large, 2.8.5, us-west-1a
Up until recently though I had felt extremely comfortable with RabbitMQ's
clustering technology and reliability... now ... not so much. Has anyone
else seen similar behaviors? Is it simply due to the fact that we're
running cross-zone now in Amazon, or is it more likely the 3 servers that
caused the problem? Or the 3.0.x upgrade?
--Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130412/e2ce6f06/attachment.htm>
More information about the rabbitmq-discuss
mailing list