[rabbitmq-discuss] How robust is clustering, and under what conditions?
Simon MacMullen
simon at rabbitmq.com
Thu Nov 15 12:33:11 GMT 2012
On 15/11/12 12:04, Eugene Kirpichov wrote:
> Is RabbitMQ HA and clustering sufficiently reliable to use it in
> scenarios where the network is good, but nodes can reboot at any time?
We believe so.
> My understanding was that this is what "HA" is supposed to mean, but
> then I read this:
>
> http://stackoverflow.com/questions/8654053/rabbitmq-cluster-is-not-reconnecting-after-network-failure
This one was a network partition - clusters don't handle partitions well.
> http://rabbitmq.1065348.n5.nabble.com/Cluster-nodes-stop-start-order-can-lead-to-failures-td21965.html
This one is the stop-start ordering problem (discussed below).
> http://rabbitmq.1065348.n5.nabble.com/Cluster-busting-shut-off-all-nodes-at-the-same-time-td22971.html:
As was this.
> http://rabbitmq.1065348.n5.nabble.com/Repairing-a-a-crashed-cluster-td22466.html
This one was unclear ("something happened"), but I took the question to
be about removing a node from a cluster when that node cannot come up.
This is handled badly in 2.x, but 3.0 will have a rabbitmqctl subcommand
to do that.
> http://grokbase.com/t/rabbitmq/rabbitmq-discuss/125nxzf5nh/highly-available-cluster
This is another stop-start ordering problem.
> And now I'm not so sure. It seems that there are a lot of scenarios
> where merely rebooting the nodes in some order brings the cluster into a
> state from which there is no automatic way out.
So the most common problem you cited above looks like this (let's
suppose we have a two node cluster AB for simplicity):
1) Stop B
2) Stop A
3) Start B
4) Start A
3) will fail. More precisely, it will wait for 30 seconds to see if 4)
happens, and if not then it will fail.
Why? Well, a lot could have happened between 1) and 2). You could have
declared or deleted all sorts of queues, changed everybody's password,
all sorts of things. B has no way to know; it was down.
It *can't* (responsibly) start up by itself. So it has to wait around
for A to become available.
To be more general, the last node to be stopped has to be the first one
to be started. No other node knows what's happened in the mean time!
> Questions:
> 1) Is there a set of assumptions or procedures under which I can be
> *certain* that my RabbitMQ cluster will actually tolerate unexpected
> node failures? Maybe something like "no more than 1 node down at the
> same time", or "at least X seconds between reboots", or "after a node
> reboots, restart all rabbit instances" or "have at most 2 nodes" etc.?
> I'm asking because I need to at least document this to my customers.
* Avoid network partitions. You can recover (see
http://next.rabbitmq.com/partitions.html) but it's a good way to
introduce problems.
* If you stop all nodes, the first (disc) node to start should be the
last one to stop.
* If you have RAM nodes, start them after you've started some disc nodes.
> 2) To what degree are the issues described in those threads fixed in the
> next release of RabbitMQ - 3.0.0, and how soon is it expected to be
> production-ready?
3.0.0 will not remove this stop-start ordering constraint. I don't see
how anything can.
However, it will have some enhancements to make clustering problems
easier to detect and fix (such as a removing a dead node without its
cooperation, making sure you don't get into a state where nodes disagree
on whether they are clustered with each other) and it will also detect
and warn more clearly about network partitions.
It should be available any day now.
Cheers, Simon
--
Simon MacMullen
RabbitMQ, VMware
More information about the rabbitmq-discuss
mailing list