[rabbitmq-discuss] How robust is clustering, and under what conditions?

Thu Nov 15 12:33:11 GMT 2012

On 15/11/12 12:04, Eugene Kirpichov wrote:
> Is RabbitMQ HA and clustering sufficiently reliable to use it in
> scenarios where the network is good, but nodes can reboot at any time?

We believe so.

> My understanding was that this is what "HA" is supposed to mean, but
> then I read this:
>
> http://stackoverflow.com/questions/8654053/rabbitmq-cluster-is-not-reconnecting-after-network-failure

This one was a network partition - clusters don't handle partitions well.

> http://rabbitmq.1065348.n5.nabble.com/Cluster-nodes-stop-start-order-can-lead-to-failures-td21965.html

This one is the stop-start ordering problem (discussed below).

> http://rabbitmq.1065348.n5.nabble.com/Cluster-busting-shut-off-all-nodes-at-the-same-time-td22971.html:

As was this.

> http://rabbitmq.1065348.n5.nabble.com/Repairing-a-a-crashed-cluster-td22466.html

This one was unclear ("something happened"), but I took the question to 
be about removing a node from a cluster when that node cannot come up. 
This is handled badly in 2.x, but 3.0 will have a rabbitmqctl subcommand 
to do that.

> http://grokbase.com/t/rabbitmq/rabbitmq-discuss/125nxzf5nh/highly-available-cluster

This is another stop-start ordering problem.

> And now I'm not so sure. It seems that there are a lot of scenarios
> where merely rebooting the nodes in some order brings the cluster into a
> state from which there is no automatic way out.

So the most common problem you cited above looks like this (let's 
suppose we have a two node cluster AB for simplicity):

1) Stop B
2) Stop A
3) Start B
4) Start A

3) will fail. More precisely, it will wait for 30 seconds to see if 4) 
happens, and if not then it will fail.

Why? Well, a lot could have happened between 1) and 2). You could have 
declared or deleted all sorts of queues, changed everybody's password, 
all sorts of things. B has no way to know; it was down.

It *can't* (responsibly) start up by itself. So it has to wait around 
for A to become available.

To be more general, the last node to be stopped has to be the first one 
to be started. No other node knows what's happened in the mean time!

> Questions:
> 1) Is there a set of assumptions or procedures under which I can be
> *certain* that my RabbitMQ cluster will actually tolerate unexpected
> node failures? Maybe something like "no more than 1 node down at the
> same time", or "at least X seconds between reboots", or "after a node
> reboots, restart all rabbit instances" or "have at most 2 nodes" etc.?
> I'm asking because I need to at least document this to my customers.

* Avoid network partitions. You can recover (see 
http://next.rabbitmq.com/partitions.html) but it's a good way to 
introduce problems.

* If you stop all nodes, the first (disc) node to start should be the 
last one to stop.

* If you have RAM nodes, start them after you've started some disc nodes.

> 2) To what degree are the issues described in those threads fixed in the
> next release of RabbitMQ - 3.0.0, and how soon is it expected to be
> production-ready?

3.0.0 will not remove this stop-start ordering constraint. I don't see 
how anything can.

However, it will have some enhancements to make clustering problems 
easier to detect and fix (such as a removing a dead node without its 
cooperation, making sure you don't get into a state where nodes disagree 
on whether they are clustered with each other) and it will also detect 
and warn more clearly about network partitions.

It should be available any day now.

Cheers, Simon

-- 
Simon MacMullen
RabbitMQ, VMware