[rabbitmq-discuss] What can cause mnesia partitioning?

Fri Oct 12 19:06:07 BST 2012

We had a pair of computers running a rabbit cluster, and somehow their
mnesia databases diverged.  Each computer was running its own rabbit
happily, but they both had cluster_status messages showing only
themselves as the only "running" node, and they both had log messages
to the effect of:

Mnesia('rabbit at node-1'): ** ERROR ** mnesia_event got
{inconsistent_database, starting_partitioned_network, 'rabbit at node-2'}

I restarted both rabbit instances, and they both came up in an
apparently functional single-node instance (cluster_status on each
still showed the other node as a disc node, but not as a running
node).  From my reading of http://www.rabbitmq.com/clustering.html, it
doesn't seem like that should happen, unless each node was somehow
convinced that it was the most up to date disc node.  Otherwise, one
of the nodes should have waited 30 seconds for the other one, and then
crashed if it couldn't be reached, right?  What sort of circumstances
would cause both nodes to think they were the most up to date, and
that they should continue running on their own?

Along that line, is there any way to configure a rabbit node to only
run if it can contact a strict majority of disc nodes?  I think that
would make this sort of problem less likely to happen, assuming the
problem stems from a network partition, or perhaps even from some
period of time where each machine was running while the other was not.