[rabbitmq-discuss] 3.0.4 extremely unstable in production...?
tim at rabbitmq.com
Mon Apr 22 12:23:53 BST 2013
On 04/16/2013 10:58 PM, Matt Wise wrote:
> Interesting... We are still running one node in us-west-1a, and one in us-west-1c. Today alone we saw 3 network glitches on the node in us-west-1a where it became unable to connect to remote services in other datacenters. Obviously the hardware or rack that machine is living on is having problems.
> The interesting part though is that RabbitMQ did not seem to report a network partition during these events now that we're running 2.8.5 and only two nodes instead of three. I'm still digging through the logs to see if there were other interruptions, but it still feels like the code is either:
> a) more stable with 2 nodes
> b) more fault-tolerant in 2.8.5 than it is in 3.0.4
If RabbitMQ doesn't report a partition, that can be due to the net tick
time being exceeded - which Michael L mentioned in his follow-up email;
Michael's advice seems sound to me, although I've only a little
experience with EC2 myself.
In terms of stability, I should point out that if a 2.8.5 cluster is not
detecting a partition but a 3.x cluster is, then the 3.x cluster is
doing the right thing - whether deliberately or not (viz Simon's email).
It's also *very* likely that if you're seeing a difference in behaviour,
then that is due to altered operating (i.e., environmental) parameters,
given that the partition handling code has not changed between the two
On the whole, an 'undetected' partition is really bad news, because the
mnesia (distributed database) system of record needs to be kept
synchronised and detecting (and responding to) partitions is a very
necessary part of this. One bug that we have seen before, for example,
pertains to nodes that go offline and recover faster than the net
ticktime and therefore leave stale information in mnesia which can
result in message loss (when a surviving node forwards messages to a
queue process id that is no longer valid, because it pertained to a
process that was running before the remote node went offline). I'm not
saying that bug is relevant to your case, but simply pointing out that
detecting partitions is better (from a consistency standpoint) than not.
In terms of the issue you're seeing: as Emile points out, Rabbit
clustering is not at all tolerant of network partitions and as Simon
mentioned previously, in 3.x we're more active about displaying
partitions and in the up-coming 3.1 release, we are providing features
that will allow clustered brokers to actively attempt to resolve
partitions if/when possible. It's quite possible that with the upcoming
3.1 features, you will be able to resolve your situation, perhaps with
the additional changes to your EC2 infrastructure/setup that Michael
More information about the rabbitmq-discuss