[rabbitmq-discuss] Odd Behavior w/ Restoring Broken Cluster

Mon Jul 8 16:11:03 BST 2013

Hello,

I am using RabbitMQ 3.1.1 on RedHat 6.2.  I noticed some odd behavior when
trying to restore a broken cluster that I think may be a bug.  In short,
when I "forget" a node in the cluster, then later call "rabbitmqctl reset"
on it, it re-adds itself to the cluster.

It's actually more complicated than that, but completely reproducible, so
here are the steps:

*Assuming two nodes in a cluster: rabbit-a and rabbit-b.  *
*
*

*[root at rabbit-a ~]# rabbitmqctl stop*
Stopping and halting node 'rabbit at rabbit-a' ...
...done.

*[root at rabbit-b ~]# rabbitmqctl stop*
Stopping and halting node 'rabbit at rabbit-b' ...
...done.

*
*
*Now we will assume we need to start rabbit-a without rabbit-b, which is
all sorts of fun since rabbit-b was the last one down.  Based on what I've
read, we need to start rabbit-a in node-only mode and then forget rabbit-b.*
*
*

*[root at rabbit-a ~]# export RABBITMQ_NODE_ONLY=true*
*[root at rabbit-a ~]# rabbitmq-server &*
[1] 19386
*[root at rabbit-a ~]# rabbitmqctl forget_cluster_node --offline
rabbit at rabbit-b*
Removing node 'rabbit at rabbit-b' from cluster ...

=INFO REPORT==== 8-Jul-2013::09:45:34 ===
Removing node 'rabbit at rabbit-b' from cluster

=INFO REPORT==== 8-Jul-2013::09:45:34 ===
    application: mnesia
    exited: stopped
    type: temporary
...done.
*[root at rabbit-a ~]# rabbitmqctl stop*
Stopping and halting node 'rabbit at rabbit-a' ...

=INFO REPORT==== 8-Jul-2013::09:45:48 ===
Halting Erlang VM
Error: {{badmatch,undefined},

[{rabbit_plugins,active,0,[{file,"src/rabbit_plugins.erl"},{line,48}]},
         {rabbit,app_shutdown_order,0,[{file,"src/rabbit.erl"},{line,476}]},
         {rabbit,stop,0,[{file,"src/rabbit.erl"},{line,380}]},
         {rabbit,stop_and_halt,0,[{file,"src/rabbit.erl"},{line,384}]},

 {rpc,'-handle_call_call/6-fun-0-',5,[{file,"rpc.erl"},{line,205}]}]}

*Note the error above when it was stopped-- I'm not sure if that is
expected.  Anyway, let's now turn off the node-only mode and start the
server again.  It's successful and note that the cluster status contains
only its own node:*

*[root at rabbit-a ~]# unset RABBITMQ_NODE_ONLY*
*[root at rabbit-a ~]# rabbitmq-server &*
[1] 21349
*[root at rabbit-a ~]# rabbitmqctl cluster_status*
Cluster status of node 'rabbit at rabbit-a' ...
[{nodes,[{disc,['rabbit at rabbit-a']}]},
 {running_nodes,['rabbit at rabbit-a']},
 {partitions,[]}]
...done.

*So far so good.  But let's assume we're ready to bring rabbit-b back
online.  If we try without making any changes, it will fail due to this
error (which I guess is expected):*

{"init terminating in
do_boot",{rabbit,failure_during_boot,{error,{inconsistent_cluster,"Node
'rabbit at rabbit-b' thinks it's clustered with node 'rabbit at rabbit-a', but
'rabbit at rabbit-a' disagrees"}}}}

*OK.  So I guess we need to reset rabbit-b before we can start it again.  I
know we could delete the mnesia directory, but let's not be so brute force
about it.  Let's put it in node-only mode and use rabbitmqctl reset:*
*
*

*[root at rabbitmq-b ~]# export RABBITMQ_NODE_ONLY=true*
*[root at rabbitmq-b ~]# rabbitmq-server &*
[1] 13647
*[root at rabbitmq-b ~]# rabbitmqctl reset*
Resetting node 'rabbit at rabbitmq-b' ...

=INFO REPORT==== 8-Jul-2013::09:49:29 ===
Resetting Rabbit

=INFO REPORT==== 8-Jul-2013::09:49:29 ===
    application: mnesia
    exited: stopped
    type: temporary
Error: {version_mismatch,[],
                         [add_ip_to_listener,exchange_decorators,
                          exchange_event_serial,gm,gm_pids,
                          mirrored_supervisor,remove_user_scope,
                          runtime_parameters,semi_durable_route,topic_trie,
                          topic_trie_node,user_admin_to_tags,add_queue_ttl,
                          multiple_routing_keys]}
*[root at rabbitmq-b ~]# rabbitmqctl stop*
Stopping and halting node 'rabbit at rabbitmq-b' ...

=INFO REPORT==== 8-Jul-2013::09:50:23 ===
Halting Erlang VM
Error: {{badmatch,undefined},

[{rabbit_plugins,active,0,[{file,"src/rabbit_plugins.erl"},{line,48}]},
         {rabbit,app_shutdown_order,0,[{file,"src/rabbit.erl"},{line,476}]},
         {rabbit,stop,0,[{file,"src/rabbit.erl"},{line,380}]},
         {rabbit,stop_and_halt,0,[{file,"src/rabbit.erl"},{line,384}]},

 {rpc,'-handle_call_call/6-fun-0-',5,[{file,"rpc.erl"},{line,205}]}]}

*Note again the error when stopping, but also the error when resetting.
 But here is the WEIRD thing.  Now go back to rabbit-a and get the
cluster_status.  It seems that rabbit-b has magically rejoined the cluster!*
*
*

*[root at rabbitmq-a ~]# rabbitmqctl cluster_status*
Cluster status of node 'rabbit at rabbitmq-a' ...
[{nodes,[{disc,['rabbit at rabbitmq-b','rabbit at rabbitmq-a']}]},
 {running_nodes,['rabbit at rabbitmq-a']},
 {partitions,[]}]
...done.

Sure enough, if we restart rabbit-b, it will be operating in a cluster with
rabbit-a again:

*[root at rabbitmq-b ~]# unset RABBITMQ_NODE_ONLY*
*[root at rabbitmq-b ~]# rabbitmq-server &*
[1] 15775
*[root at rabbitmq-b ~]# rabbitmqctl cluster_status*
Cluster status of node 'rabbit at rabbitmq-b' ...
[{nodes,[{disc,['rabbit at rabbitmq-b','rabbit at rabbitmq-a']}]},
 {running_nodes,['rabbit at vm-rh62-cmoesel','rabbitmq-b']},
 {partitions,[]}]
...done.

So-- this is not at all what I expected.  Seems like a bug, right?  I guess
in this case I will just delete the mnesia directory instead of trying to
do a reset.

-Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130708/39fad031/attachment.htm>