[rabbitmq-discuss] rabbitmq cluster manage/ops questions

Wed Aug 15 20:33:08 BST 2012

Hello,

I'm working on a project involving clustered rabbitmq brokers and I
would like to gain a better understanding of the operational
constraints. I've read the clustering article on the site, but I feel
like I don't have a solid understanding of it yet.

Specific questions:

1) What constraints need to be observed to guarantee the cluster state
remains consistent. I.e., the cluster will not fall into a "split
brain" state.
2) Is there anything invalid or problematic about the tests I describe below?

I've been running tests involving 2-3 clustered brokers. All brokers
are running 2.7.1. The cookies are synched properly.

The cluster can be in one of the following states:

2 disc
2 disc, 1 ram
3 disc

In any of the states, my test can kill one of the brokers (ram or
disc, it doesn't discriminate.) If a broker is killed the next event
the test would execute is either a restart of the dead broker or a
replacement of the dead broker. Replacement is done by deleting the
mnesia database on that node and then service start, stop_app, reset,
force_cluster, start_app.

In the 2 broker-cluster states, the cluster can be grown to size of 3.

In the 3 broker-cluster states, the cluster can be shrunk to size of 2
with a constraint that it won't ever shrink to 1 disc/1 ram.

Growing and shrinking of the cluster is always done by running
rabbitmqctl commands. I.e., there's no cluster configuration in the
rabbitmq.config file. For those who will ask, the commands I'm running
to grow and shrink the cluster are:

1) grow cluster by adding a ram node
rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl cluster <existing-disc-node>
rabbitmqctl start_app

2) grow cluster by adding a disc node
rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl cluster <existing-disc-node> <node-to-be-added>
rabbitmqctl stop_app

3) shrink cluster by removing a node
rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl start_app

I'm not inserting any delays between execution of events (other than
the implicit delay of having to ssh into the server and execute the
command.)

One issue I've encountered so far:

1) rabbit fails to start after shrinking 2 disc/1 ram cluster to 2
disc cluster and then killing a disc node. Here's teh log from the
disc node which fails to start. There's also output from my test
script at the bottom which shows the cluster status:
http://pastebin.com/95McUzkb

-Torin