[rabbitmq-discuss] How robust is clustering, and under what conditions?

Thu Nov 15 13:28:51 GMT 2012

Hi Simon,

Thank you, it all makes sense now.

So, we can say "either reboot one node at a time, or - if you're rebooting
all of them - make sure they start in reverse order, or simultaneously in a
window of 30sec max".

Can we also say "if something bad happened, kill -9 all rabbits, then start
them in a window of 30sec max"? [I'm talking kill -9 because in some cases
with messed up startup order, rabbitmqctl stop also hangs]

On Thu, Nov 15, 2012 at 4:33 PM, Simon MacMullen <simon at rabbitmq.com> wrote:

> On 15/11/12 12:04, Eugene Kirpichov wrote:
>
>> Is RabbitMQ HA and clustering sufficiently reliable to use it in
>> scenarios where the network is good, but nodes can reboot at any time?
>>
>
> We believe so.
>
>
>  My understanding was that this is what "HA" is supposed to mean, but
>> then I read this:
>>
>> http://stackoverflow.com/**questions/8654053/rabbitmq-**
>> cluster-is-not-reconnecting-**after-network-failure<http://stackoverflow.com/questions/8654053/rabbitmq-cluster-is-not-reconnecting-after-network-failure>
>>
>
> This one was a network partition - clusters don't handle partitions well.
>
>  http://rabbitmq.1065348.n5.**nabble.com/Cluster-nodes-stop-**
>> start-order-can-lead-to-**failures-td21965.html<http://rabbitmq.1065348.n5.nabble.com/Cluster-nodes-stop-start-order-can-lead-to-failures-td21965.html>
>>
>
> This one is the stop-start ordering problem (discussed below).
>
>  http://rabbitmq.1065348.n5.**nabble.com/Cluster-busting-**
>> shut-off-all-nodes-at-the-**same-time-td22971.html<http://rabbitmq.1065348.n5.nabble.com/Cluster-busting-shut-off-all-nodes-at-the-same-time-td22971.html>
>> :
>>
>
> As was this.
>
>  http://rabbitmq.1065348.n5.**nabble.com/Repairing-a-a-**
>> crashed-cluster-td22466.html<http://rabbitmq.1065348.n5.nabble.com/Repairing-a-a-crashed-cluster-td22466.html>
>>
>
> This one was unclear ("something happened"), but I took the question to be
> about removing a node from a cluster when that node cannot come up. This is
> handled badly in 2.x, but 3.0 will have a rabbitmqctl subcommand to do that.
>
>  http://grokbase.com/t/**rabbitmq/rabbitmq-discuss/**
>> 125nxzf5nh/highly-available-**cluster<http://grokbase.com/t/rabbitmq/rabbitmq-discuss/125nxzf5nh/highly-available-cluster>
>>
>
> This is another stop-start ordering problem.
>
>
>  And now I'm not so sure. It seems that there are a lot of scenarios
>> where merely rebooting the nodes in some order brings the cluster into a
>> state from which there is no automatic way out.
>>
>
> So the most common problem you cited above looks like this (let's suppose
> we have a two node cluster AB for simplicity):
>
> 1) Stop B
> 2) Stop A
> 3) Start B
> 4) Start A
>
> 3) will fail. More precisely, it will wait for 30 seconds to see if 4)
> happens, and if not then it will fail.
>
> Why? Well, a lot could have happened between 1) and 2). You could have
> declared or deleted all sorts of queues, changed everybody's password, all
> sorts of things. B has no way to know; it was down.
>
> It *can't* (responsibly) start up by itself. So it has to wait around for
> A to become available.
>
> To be more general, the last node to be stopped has to be the first one to
> be started. No other node knows what's happened in the mean time!
>
>
>  Questions:
>> 1) Is there a set of assumptions or procedures under which I can be
>> *certain* that my RabbitMQ cluster will actually tolerate unexpected
>> node failures? Maybe something like "no more than 1 node down at the
>> same time", or "at least X seconds between reboots", or "after a node
>> reboots, restart all rabbit instances" or "have at most 2 nodes" etc.?
>> I'm asking because I need to at least document this to my customers.
>>
>
> * Avoid network partitions. You can recover (see http://next.rabbitmq.com/
> **partitions.html <http://next.rabbitmq.com/partitions.html>) but it's a
> good way to introduce problems.
>
> * If you stop all nodes, the first (disc) node to start should be the last
> one to stop.
>
> * If you have RAM nodes, start them after you've started some disc nodes.
>
>
>  2) To what degree are the issues described in those threads fixed in the
>> next release of RabbitMQ - 3.0.0, and how soon is it expected to be
>> production-ready?
>>
>
> 3.0.0 will not remove this stop-start ordering constraint. I don't see how
> anything can.
>
> However, it will have some enhancements to make clustering problems easier
> to detect and fix (such as a removing a dead node without its cooperation,
> making sure you don't get into a state where nodes disagree on whether they
> are clustered with each other) and it will also detect and warn more
> clearly about network partitions.
>
> It should be available any day now.
>
> Cheers, Simon
>
> --
> Simon MacMullen
> RabbitMQ, VMware
>

-- 
Eugene Kirpichov
http://www.linkedin.com/in/eugenekirpichov
We're hiring! http://tinyurl.com/mirantis-openstack-engineer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20121115/75357494/attachment.htm>