[rabbitmq-discuss] 3.0.4 extremely unstable in production...?

Mon Apr 22 12:23:53 BST 2013

Hi Matt,

On 04/16/2013 10:58 PM, Matt Wise wrote:
> Interesting... We are still running one node in us-west-1a, and one in us-west-1c. Today alone we saw 3 network glitches on the node in us-west-1a where it became unable to connect to remote services in other datacenters. Obviously the hardware or rack that machine is living on is having problems.
>
> The interesting part though is that RabbitMQ did not seem to report a network partition during these events now that we're running 2.8.5 and only two nodes instead of three. I'm still digging through the logs to see if there were other interruptions, but it still feels like the code is either:
>    a) more stable with 2 nodes
>    b) more fault-tolerant in 2.8.5 than it is in 3.0.4

If RabbitMQ doesn't report a partition, that can be due to the net tick 
time being exceeded - which Michael L mentioned in his follow-up email; 
Michael's advice seems sound to me, although I've only a little 
experience with EC2 myself.

In terms of stability, I should point out that if a 2.8.5 cluster is not 
detecting a partition but a 3.x cluster is, then the 3.x cluster is 
doing the right thing - whether deliberately or not (viz Simon's email). 
It's also *very* likely that if you're seeing a difference in behaviour, 
then that is due to altered operating (i.e., environmental) parameters, 
given that the partition handling code has not changed between the two 
aforementioned versions.

On the whole, an 'undetected' partition is really bad news, because the 
mnesia (distributed database) system of record needs to be kept 
synchronised and detecting (and responding to) partitions is a very 
necessary part of this. One bug that we have seen before, for example, 
pertains to nodes that go offline and recover faster than the net 
ticktime and therefore leave stale information in mnesia which can 
result in message loss (when a surviving node forwards messages to a 
queue process id that is no longer valid, because it pertained to a 
process that was running before the remote node went offline). I'm not 
saying that bug is relevant to your case, but simply pointing out that 
detecting partitions is better (from a consistency standpoint) than not.

In terms of the issue you're seeing: as Emile points out, Rabbit 
clustering is not at all tolerant of network partitions and as Simon 
mentioned previously, in 3.x we're more active about displaying 
partitions and in the up-coming 3.1 release, we are providing features 
that will allow clustered brokers to actively attempt to resolve 
partitions if/when possible. It's quite possible that with the upcoming 
3.1 features, you will be able to resolve your situation, perhaps with 
the additional changes to your EC2 infrastructure/setup that Michael 
recommended.

Cheers,
Tim