[rabbitmq-discuss] 3.0.4 extremely unstable in production...?

Mon Apr 15 10:16:23 BST 2013

Hi,

On 12/04/13 19:36, Matt Wise wrote:
> Since creating the new server farm though we've had 3 outages. In the
> first two outages we received a Network Partition Split, and effectively
> all 3 of the systems decided to run their own queues independently of
> the other servers. This was the first time we'd ever seen this failure,
> ever. In the most recent failure we had 2 machines split off, and the
> 3rd rabbitmq service effectively became unresponsive entirely.

Versions 2.8.x and 3.0.x are equally susceptible to partitions. You can
confirm this experimentally by setting up a cluster of v2.8.x nodes and
interrupting connectivity for twice the net_ticktime (60s by default).

See https://www.rabbitmq.com/partitions.html

> Up until recently though I had felt extremely comfortable with
> RabbitMQ's clustering technology and reliability... now ... not so much.
> Has anyone else seen similar behaviors? Is it simply due to the fact
> that we're running cross-zone now in Amazon, or is it more likely the 3
> servers that caused the problem? Or the 3.0.x upgrade?

A network outage coincided with the period when nodes were running
v3.0.4. The network interruption is the cause of the partition rather
than the broker version.

At the time of writing RabbitMQ clustering does not tolerate network
partitions well, so it should not be used over a WAN. The shovel or
federation plugins are better solutions for that case.

See http://www.rabbitmq.com/clustering.html

-Emile