[rabbitmq-discuss] Clustering question
Dmitriy Samovskiy
dmitriy.samovskiy at cohesiveft.com
Fri Sep 26 16:23:01 BST 2008
Ben Hood wrote:
> Dmitriy,
>
> On Fri, Sep 19, 2008 at 4:59 PM, Dmitriy Samovskiy
> wrote:
>>> Were there any mnesia-related errors in the logs?
>> Yes, I see ** ERROR ** mnesia_event got {inconsistent_database,
>> running_partitioned_network in rabbit.log. My bad for writing to the list without looking
>> in the logs...
>
I spent some time this week trying to understand this issue better. Please note however
that I am not an Erlang expert, and it's quite possible I got it all wrong.
RabbitMQ Cluster consisting of 2 nodes running on 2 hosts will enter
"{inconsistent_database, running_partitioned_network}" state any time when network
connectivity between the hosts is lost for sufficiently long period of time *and* then
restored *while* rabbit nodes remain up and running (connectivity loss must be long enough
for rabbit to notice this). After this happens, mnesia tables will no longer be
replicated. Which for us effectively means cluster is no longer a cluster - each rabbit
node now acts as a standalone broker.
Note I may be wrong in saying "must be long enough for rabbit to notice" - haven't tested
it well enough. In theory, it looks to me like cluster should not be impacted if
connectivity loss was short.
I suspect based on what I read (but have not verified it) that if your cluster consists of
N nodes where N > 2, a partitioned network between any two all cause entire cluster to
break down. I may be wrong here, possibly when some nodes are disk replicas and some nodes
are ram replicas.
There is no common solution to this problem other than restart (entire node, or at least
mnesia). Some people on erlang-questions reported running an external app to watch out for
this state outside of distributed erlang, and then make a decision which nodes to restart
or what else to do.
http://www.erlang.org/pipermail/erlang-questions/2008-March/033291.html
http://www.erlang.org/pipermail/erlang-questions/2004-February/011587.html
http://groups.google.com/group/erlang-questions/search?hl=en&group=erlang-questions&q=mnesia+%22partitioned+network%22
All in all, to answer my original question, there is nothing one can do by remsh'ing into
a running rabbit node to restore the cluster except mnesia:stop() and mnesia:start() on
specific nodes, and even then it might not work.
Looking forward to AMQP-level federation and HA...
- Dmitriy
More information about the rabbitmq-discuss
mailing list