[rabbitmq-discuss] Clustering issue

Valentin Bernard vbernard42 at gmail.com
Tue Apr 19 14:16:24 BST 2011


Hi,

Thank you for your explanations.

Yet, there is still something odd about it. Here is a concrete case:

We have three nodes on three different hosts (n1, n2 and n3), all
connected as a cluster on the local network.
The "rabbitmqctl status" command on any of these nodes shows the
following:
   {nodes,[{disc,[rabbit at n1,rabbit at n2,rabbit at n3]}]},
   {running_nodes,[rabbit at n1,rabbit at n2,rabbit at n3]}]

Now, we physically disconnect n1 from the local network, then
reconnect it after a little while.

n1 now shows the following:
   {nodes,[{disc,[rabbit at n1,rabbit at n2,rabbit at n3]}]},
   {running_nodes,[rabbit at n1]}]

... while n2 and n3 both logically show the following:
   {nodes,[{disc,[rabbit at n1,rabbit at n2,rabbit at n3]}]},
   {running_nodes,[rabbit at n2,rabbit at n3]}]

The cluster is partitioned, as you've explained. Now here is the
thing: if we just run the following commands on either n1, n2 OR n3:
   > rabbitmqctl stop_app
   > rabbitmqctl start_app

... then all the nodes are back to the cluster, and the "status"
command shows the following on every node:
   {nodes,[{disc,[rabbit at n1,rabbit at n2,rabbit at n3]}]},
   {running_nodes,[rabbit at n1,rabbit at n2,rabbit at n3]}]

... and at the same time, we get a few inconsistent_database/
running_partitioned_network errors in the logs. If, while the cluster
was partitioned, we added or deleted some queues/exchanges, the nodes
indeed aren't synchronized, and even though they consider being in the
same cluster (and are able to communicate between each other), they
aren't aware of the same entities. No node were reseted during the
whole process. Is that a bug? This really isn't a problem for us, but
that's still somewhat disturbing – if the mnesia database is
inconsistent, the cluster should probably remain partitioned ;)

Thanks,

Valentin.

On 19 avr, 12:59, Matthew Sackman <matt... at rabbitmq.com> wrote:
> Hi,
>
> On Fri, Apr 08, 2011 at 01:33:15AM -0700, Valentin Bernard wrote:
> > I have a question regarding clustering. If, for test reasons, I break
> > the physical link between two nodes (without stopping them), then
> > restore the link, the cluster is split and the nodes can't communicate
> > between each other until I either kill a node process and restart it,
> > or run the stop_app/cluster/start_app commands on one node with
> > rabbitmqctl.
>
> > Is that a normal behavior?
>
> Yes.
>
> > I couldn't find any documentation or
> > discussion about this issue. Is there a way to make the nodes
> > automatically join the cluster back after a network failure?
>
> No, not without resetting one node. During the time of the network
> partition, both nodes remain working, but they can diverge - e.g. one
> node could have a client delete a queue, but the other node doesn't.
> It's then not clear what should happen when the partition goes away.
> Rabbit does not attempt to cope with partitions - it's incredibly
> difficult.
>
> Matthew
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-disc... at lists.rabbitmq.comhttps://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss


More information about the rabbitmq-discuss mailing list