[rabbitmq-discuss] AWS clustering

Wed Sep 5 10:01:47 BST 2012

On 05/09/12 09:46, Francesco Mazzoli wrote:
> At Tue, 4 Sep 2012 20:01:38 -0700 (PDT),
> Glade wrote:
>> For a supposedly "just works" kind of service, that is just not good enough. I
>> can't have my ops people rolling out of bed to take action every time there's
>> a minor network glitch.
>
> Rabbit clustering is meant to be run on local networks and is not tolerant to
> "network glitches".  If you expect those, then don't use it.

There is in fact some built-in tolerance to glitches. Specifically, TCP 
should be tolerant to network glitches if appropriately configured (or, 
rather, unless inappropriately misconfigured).

You may also want to increase Erlang's kernel net_ticktime. See 
http://www.erlang.org/doc/man/kernel_app.html. And make sure you are 
running the most recent Erlang release.

If, however, the glitch is severe enough to exceed those tolerances 
then you'll end up with a proper network split, which, as Francesco 
notes, Erlang's mnesia distributed db (on which rabbit's cluster is 
based) cannot cope with, thus requiring manual intervention to recover.

Regards,

Matthias.