[rabbitmq-discuss] Odd Behavior w/ Restoring Broken Cluster
Chris
stuff at moesel.net
Mon Jul 8 16:11:03 BST 2013
Hello,
I am using RabbitMQ 3.1.1 on RedHat 6.2. I noticed some odd behavior when
trying to restore a broken cluster that I think may be a bug. In short,
when I "forget" a node in the cluster, then later call "rabbitmqctl reset"
on it, it re-adds itself to the cluster.
It's actually more complicated than that, but completely reproducible, so
here are the steps:
*Assuming two nodes in a cluster: rabbit-a and rabbit-b. *
*
*
*[root at rabbit-a ~]# rabbitmqctl stop*
Stopping and halting node 'rabbit at rabbit-a' ...
...done.
*[root at rabbit-b ~]# rabbitmqctl stop*
Stopping and halting node 'rabbit at rabbit-b' ...
...done.
*
*
*Now we will assume we need to start rabbit-a without rabbit-b, which is
all sorts of fun since rabbit-b was the last one down. Based on what I've
read, we need to start rabbit-a in node-only mode and then forget rabbit-b.*
*
*
*[root at rabbit-a ~]# export RABBITMQ_NODE_ONLY=true*
*[root at rabbit-a ~]# rabbitmq-server &*
[1] 19386
*[root at rabbit-a ~]# rabbitmqctl forget_cluster_node --offline
rabbit at rabbit-b*
Removing node 'rabbit at rabbit-b' from cluster ...
=INFO REPORT==== 8-Jul-2013::09:45:34 ===
Removing node 'rabbit at rabbit-b' from cluster
=INFO REPORT==== 8-Jul-2013::09:45:34 ===
application: mnesia
exited: stopped
type: temporary
...done.
*[root at rabbit-a ~]# rabbitmqctl stop*
Stopping and halting node 'rabbit at rabbit-a' ...
=INFO REPORT==== 8-Jul-2013::09:45:48 ===
Halting Erlang VM
Error: {{badmatch,undefined},
[{rabbit_plugins,active,0,[{file,"src/rabbit_plugins.erl"},{line,48}]},
{rabbit,app_shutdown_order,0,[{file,"src/rabbit.erl"},{line,476}]},
{rabbit,stop,0,[{file,"src/rabbit.erl"},{line,380}]},
{rabbit,stop_and_halt,0,[{file,"src/rabbit.erl"},{line,384}]},
{rpc,'-handle_call_call/6-fun-0-',5,[{file,"rpc.erl"},{line,205}]}]}
*Note the error above when it was stopped-- I'm not sure if that is
expected. Anyway, let's now turn off the node-only mode and start the
server again. It's successful and note that the cluster status contains
only its own node:*
*[root at rabbit-a ~]# unset RABBITMQ_NODE_ONLY*
*[root at rabbit-a ~]# rabbitmq-server &*
[1] 21349
*[root at rabbit-a ~]# rabbitmqctl cluster_status*
Cluster status of node 'rabbit at rabbit-a' ...
[{nodes,[{disc,['rabbit at rabbit-a']}]},
{running_nodes,['rabbit at rabbit-a']},
{partitions,[]}]
...done.
*So far so good. But let's assume we're ready to bring rabbit-b back
online. If we try without making any changes, it will fail due to this
error (which I guess is expected):*
{"init terminating in
do_boot",{rabbit,failure_during_boot,{error,{inconsistent_cluster,"Node
'rabbit at rabbit-b' thinks it's clustered with node 'rabbit at rabbit-a', but
'rabbit at rabbit-a' disagrees"}}}}
*OK. So I guess we need to reset rabbit-b before we can start it again. I
know we could delete the mnesia directory, but let's not be so brute force
about it. Let's put it in node-only mode and use rabbitmqctl reset:*
*
*
*[root at rabbitmq-b ~]# export RABBITMQ_NODE_ONLY=true*
*[root at rabbitmq-b ~]# rabbitmq-server &*
[1] 13647
*[root at rabbitmq-b ~]# rabbitmqctl reset*
Resetting node 'rabbit at rabbitmq-b' ...
=INFO REPORT==== 8-Jul-2013::09:49:29 ===
Resetting Rabbit
=INFO REPORT==== 8-Jul-2013::09:49:29 ===
application: mnesia
exited: stopped
type: temporary
Error: {version_mismatch,[],
[add_ip_to_listener,exchange_decorators,
exchange_event_serial,gm,gm_pids,
mirrored_supervisor,remove_user_scope,
runtime_parameters,semi_durable_route,topic_trie,
topic_trie_node,user_admin_to_tags,add_queue_ttl,
multiple_routing_keys]}
*[root at rabbitmq-b ~]# rabbitmqctl stop*
Stopping and halting node 'rabbit at rabbitmq-b' ...
=INFO REPORT==== 8-Jul-2013::09:50:23 ===
Halting Erlang VM
Error: {{badmatch,undefined},
[{rabbit_plugins,active,0,[{file,"src/rabbit_plugins.erl"},{line,48}]},
{rabbit,app_shutdown_order,0,[{file,"src/rabbit.erl"},{line,476}]},
{rabbit,stop,0,[{file,"src/rabbit.erl"},{line,380}]},
{rabbit,stop_and_halt,0,[{file,"src/rabbit.erl"},{line,384}]},
{rpc,'-handle_call_call/6-fun-0-',5,[{file,"rpc.erl"},{line,205}]}]}
*Note again the error when stopping, but also the error when resetting.
But here is the WEIRD thing. Now go back to rabbit-a and get the
cluster_status. It seems that rabbit-b has magically rejoined the cluster!*
*
*
*[root at rabbitmq-a ~]# rabbitmqctl cluster_status*
Cluster status of node 'rabbit at rabbitmq-a' ...
[{nodes,[{disc,['rabbit at rabbitmq-b','rabbit at rabbitmq-a']}]},
{running_nodes,['rabbit at rabbitmq-a']},
{partitions,[]}]
...done.
Sure enough, if we restart rabbit-b, it will be operating in a cluster with
rabbit-a again:
*[root at rabbitmq-b ~]# unset RABBITMQ_NODE_ONLY*
*[root at rabbitmq-b ~]# rabbitmq-server &*
[1] 15775
*[root at rabbitmq-b ~]# rabbitmqctl cluster_status*
Cluster status of node 'rabbit at rabbitmq-b' ...
[{nodes,[{disc,['rabbit at rabbitmq-b','rabbit at rabbitmq-a']}]},
{running_nodes,['rabbit at vm-rh62-cmoesel','rabbitmq-b']},
{partitions,[]}]
...done.
So-- this is not at all what I expected. Seems like a bug, right? I guess
in this case I will just delete the mnesia directory instead of trying to
do a reset.
-Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130708/39fad031/attachment.htm>
More information about the rabbitmq-discuss
mailing list