[rabbitmq-discuss] Clustering question

Fri Sep 26 16:23:01 BST 2008

Ben Hood wrote:
> Dmitriy,
> 
> On Fri, Sep 19, 2008 at 4:59 PM, Dmitriy Samovskiy
> wrote:
>>> Were there any mnesia-related errors in the logs?
>> Yes, I see ** ERROR ** mnesia_event got {inconsistent_database,
>> running_partitioned_network in rabbit.log. My bad for writing to the list without looking
>> in the logs...
> 
I spent some time this week trying to understand this issue better. Please note however 
that I am not an Erlang expert, and it's quite possible I got it all wrong.

RabbitMQ Cluster consisting of 2 nodes running on 2 hosts will enter 
"{inconsistent_database, running_partitioned_network}" state any time when network 
connectivity between the hosts is lost for sufficiently long period of time *and* then 
restored *while* rabbit nodes remain up and running (connectivity loss must be long enough 
for rabbit to notice this). After this happens, mnesia tables will no longer be 
replicated. Which for us effectively means cluster is no longer a cluster - each rabbit 
node now acts as a standalone broker.

Note I may be wrong in saying "must be long enough for rabbit to notice" - haven't tested 
it well enough. In theory, it looks to me like cluster should not be impacted if 
connectivity loss was short.

I suspect based on what I read (but have not verified it) that if your cluster consists of 
N nodes where N > 2, a partitioned network between any two all cause entire cluster to 
break down. I may be wrong here, possibly when some nodes are disk replicas and some nodes 
are ram replicas.

There is no common solution to this problem other than restart (entire node, or at least 
mnesia). Some people on erlang-questions reported running an external app to watch out for 
this state outside of distributed erlang, and then make a decision which nodes to restart 
or what else to do.

http://www.erlang.org/pipermail/erlang-questions/2008-March/033291.html
http://www.erlang.org/pipermail/erlang-questions/2004-February/011587.html
http://groups.google.com/group/erlang-questions/search?hl=en&group=erlang-questions&q=mnesia+%22partitioned+network%22

All in all, to answer my original question, there is nothing one can do by remsh'ing into 
a running rabbit node to restore the cluster except mnesia:stop() and mnesia:start() on 
specific nodes, and even then it might not work.

Looking forward to AMQP-level federation and HA...

- Dmitriy