[rabbitmq-discuss] 3.0.4 extremely unstable in production...?

Mon Apr 22 16:12:24 BST 2013

Thanks guys for all of the feedback and details. We've been doing some testing with clusters of 2 and 3 RabbitMQ nodes and we've noticed that its relatively easy to trigger a partition with 3 nodes, but when we use 2 it seems that its much more resilient.

In the short term, I think we're going to use 2 nodes in our RabbitMQ farm. We're working on a patch for Celery that leverages our nd_service_registry module (https://github.com/nextdoor/ndserviceregistry) to dynamically discover the RabbitMQ nodes. With this code, we should be able to have all of our clients connect to 'the first' host in the RabbitMQ farm thats available... so in the event of a network partition, our servers will still all be connected to the same place. This leaves the second node as a data mirror, and failover server just in case the first server fails.

Once 3.1 is out and we can experiment more with some of the automatic healing-code you guys are working on, we'll re-evaluvate the layout and see where we go from there. Federation is really interesting, but I think for now we'd rather keep the environment as simple as possible.

--Matt

On Apr 22, 2013, at 4:23 AM, Tim Watson <tim at rabbitmq.com> wrote:

> Hi Matt,
> 
> On 04/16/2013 10:58 PM, Matt Wise wrote:
>> Interesting... We are still running one node in us-west-1a, and one in us-west-1c. Today alone we saw 3 network glitches on the node in us-west-1a where it became unable to connect to remote services in other datacenters. Obviously the hardware or rack that machine is living on is having problems.
>> 
>> The interesting part though is that RabbitMQ did not seem to report a network partition during these events now that we're running 2.8.5 and only two nodes instead of three. I'm still digging through the logs to see if there were other interruptions, but it still feels like the code is either:
>>   a) more stable with 2 nodes
>>   b) more fault-tolerant in 2.8.5 than it is in 3.0.4
> 
> If RabbitMQ doesn't report a partition, that can be due to the net tick time being exceeded - which Michael L mentioned in his follow-up email; Michael's advice seems sound to me, although I've only a little experience with EC2 myself.
> 
> In terms of stability, I should point out that if a 2.8.5 cluster is not detecting a partition but a 3.x cluster is, then the 3.x cluster is doing the right thing - whether deliberately or not (viz Simon's email). It's also *very* likely that if you're seeing a difference in behaviour, then that is due to altered operating (i.e., environmental) parameters, given that the partition handling code has not changed between the two aforementioned versions.
> 
> On the whole, an 'undetected' partition is really bad news, because the mnesia (distributed database) system of record needs to be kept synchronised and detecting (and responding to) partitions is a very necessary part of this. One bug that we have seen before, for example, pertains to nodes that go offline and recover faster than the net ticktime and therefore leave stale information in mnesia which can result in message loss (when a surviving node forwards messages to a queue process id that is no longer valid, because it pertained to a process that was running before the remote node went offline). I'm not saying that bug is relevant to your case, but simply pointing out that detecting partitions is better (from a consistency standpoint) than not.
> 
> In terms of the issue you're seeing: as Emile points out, Rabbit clustering is not at all tolerant of network partitions and as Simon mentioned previously, in 3.x we're more active about displaying partitions and in the up-coming 3.1 release, we are providing features that will allow clustered brokers to actively attempt to resolve partitions if/when possible. It's quite possible that with the upcoming 3.1 features, you will be able to resolve your situation, perhaps with the additional changes to your EC2 infrastructure/setup that Michael recommended.
> 
> Cheers,
> Tim
>