[rabbitmq-discuss] Bring cluster up after node crash

Carl Hörberg carl.hoerberg at gmail.com
Tue Mar 19 03:41:21 GMT 2013


Have a 3 node cluster, node 2 and 3 went down due to OOM, but node 1 survived, clients could push new messages but non were delivered, node 1 had plenty of memory left so no blocking were (or at least shouldn't have been) in action due to that.  

I then tried to bring node 2 and 3 back online by simply restarting them, this is what happened:  

Node1 floods the logs for a while at a rate of 20-100/sec:
=ERROR REPORT==== 18-Mar-2013::07:10:40 ===
Discarding message {'$gen_call',{<0.17965.1>,#Ref<0.0.1.90282>},stat} from <0.17965.1> to <0.5037.1> in an old incarnation (1) of this node (2)

Start up node 3
Floods
=ERROR REPORT==== 18-Mar-2013::08:23:15 ===
Discarding message {'$gen_call',{<0.7609.0>,#Ref<0.0.1.142489>},stat} from <0.7609.0> to <0.25515.26> in an old incarnation (1) of this node (3)
and is stuck at  
"starting exchange, queue and binding recovery ..."

rabbitmqctl status hangs for ever on node 1

Start up node 2, starts fast, says "Broker started" in startup_log, but doesn't list the plugins, "service rabbitmq-server start" never returns and  rabbitmqctl status and  never returns

node 2 then runs out of memory again, without client connections this time:  
=INFO REPORT==== 18-Mar-2013::09:09:35 ===
vm_memory_high_watermark set. Memory used:7336394640 allowed:7031336140
=WARNING REPORT==== 18-Mar-2013::09:09:35 ===
memory resource limit alarm set on node rabbit at tiger02

Querying /api/overview at node1 gives:
{error,{error,{badmatch,false},
[{rabbit_mgmt_wm_overview,version,1},
{rabbit_mgmt_wm_overview,to_json,2},
{webmachine_resource,resource_call,3},
{webmachine_resource,do,3},
{webmachine_decision_core,resource_call,1},
{webmachine_decision_core,decision,1},
{webmachine_decision_core,handle_request,2},
{rabbit_webmachine,'-makeloop/1-fun-0-',2}]}}

node 3 starts eventually.  
kills node 2, starts again, stops at "starting database …"
nothing in the log or startup_err, cpu usage 0%
kills after 30min and starts again, same thing.  

node 3 can now output rabbitmqctl status, node 1 still cannot.
node 1 can't be shutdown, force kills
with node1 down, node 2 now comes pass "starting database" and starts
neither node 2 or node 3 responds to rabbitmqctl status
shutting down node 2, but doesn't respond, have to do kill -9
node 3 still doesn't respond to rabbitmqctl status
shutdowns node 3, doesnt respond, killing it instead, now all nodes are down.

note: When rabbitmqctl status doesnt work other stuff like list_users, cluster_status etc. works.  

Starting up node3, log now gets flooded with:
=ERROR REPORT==== 18-Mar-2013::11:09:04 ===
** Generic server <0.629.0> terminating
** Last message in was {init,<0.182.0>}
** When Server state == {q,{amqqueue,
{resource,<<"vhost1">>,queue,
<<"tmp_topic-0.9934227485209703">>},
true,true,<0.21310.24>,[],<0.629.0>,[],[],
[{vhost,<<"vhost1">>},
{name,<<"HA">>},
{pattern,<<".*">>},
{definition,[{<<"ha-mode">>,<<"all">>}]},
{priority,0}],
[{<6868.7071.0>,<6868.7070.0>},
{<6867.19845.80>,<6867.19844.80>},
{<0.21601.24>,<0.21548.24>}]},
none,false,undefined,undefined,
{[],[]},
undefined,undefined,undefined,undefined,
{state,fine,5000,undefined},
{0,nil},
undefined,undefined,undefined,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
1,
{{0,nil},{0,nil}},
undefined,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
undefined,undefined}
** Reason for termination ==  
** {'module could not be loaded',
[{undefined,init,
[{amqqueue,
{resource,<<"vhost1">>,queue,
<<"tmp_topic-0.9934227485209703">>},
true,true,<0.21310.24>,[],<0.629.0>,[],[],
[{vhost,<<"vhost1">>},
{name,<<"HA">>},
{pattern,<<".*">>},
{definition,[{<<"ha-mode">>,<<"all">>}]},
{priority,0}],
[{<6868.7071.0>,<6868.7070.0>},
{<6867.19845.80>,<6867.19844.80>},
{<0.21601.24>,<0.21548.24>}]},
true,#Fun<rabbit_amqqueue_process.5.64830354>]},
{rabbit_amqqueue_process,handle_call,3},
{gen_server2,handle_msg,2},
{proc_lib,wake_up,3}]}

but comes online eventually and can do "rabbitmqctl status"

starts up node2, also reports a lot of:
=ERROR REPORT==== 18-Mar-2013::11:11:06 ===
** Generic server <0.640.0> terminating
** Last message in was {init,<0.152.0>}
** When Server state == {q,{amqqueue,
{resource,<<"vhost1">>,queue,
<<"tmp_topic-0.1019297200255096">>},
true,true,<0.977.11>,[],<0.640.0>,[],[],
undefined,[]},
none,false,undefined,undefined,
{[],[]},
undefined,undefined,undefined,undefined,
{state,fine,5000,undefined},
{0,nil},
undefined,undefined,undefined,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
1,
{{0,nil},{0,nil}},
undefined,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
undefined,undefined}
** Reason for termination ==  
** {'module could not be loaded',
[{undefined,init,
[{amqqueue,
{resource,<<"vhost1">>,queue,
<<"tmp_topic-0.1019297200255096">>},
true,true,<0.977.11>,[],<0.640.0>,[],[],undefined,[]},
true,#Fun<rabbit_amqqueue_process.5.64830354>]},
{rabbit_amqqueue_process,handle_call,3},
{gen_server2,handle_msg,2},
{proc_lib,wake_up,3}]}
=ERROR REPORT==== 18-Mar-2013::11:11:06 ===
** Generic server <0.645.0> terminating
** Last message in was {init,<0.152.0>}
** When Server state == {q,{amqqueue,
{resource,<<"vhost1">>,queue,
<<"tmp_topic-0.8794151877518743">>},
true,true,<0.30538.0>,[],<0.645.0>,[],[],
[{vhost,<<"vhost1">>},
{name,<<"HA">>},
{pattern,<<".*">>},
{definition,[{<<"ha-mode">>,<<"all">>}]},
{priority,0}],
[{<6872.28270.5>,<6872.28269.5>},
{<0.32304.1>,<0.30804.0>}]},
none,false,undefined,undefined,
{[],[]},
undefined,undefined,undefined,undefined,
{state,fine,5000,undefined},
{0,nil},
undefined,undefined,undefined,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
1,
{{0,nil},{0,nil}},
undefined,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
undefined,undefined}
** Reason for termination ==  
** {'module could not be loaded',
[{undefined,init,
[{amqqueue,
{resource,<<"vhost1">>,queue,
<<"tmp_topic-0.8794151877518743">>},
true,true,<0.30538.0>,[],<0.645.0>,[],[],
[{vhost,<<"vhost1">>},
{name,<<"HA">>},
{pattern,<<".*">>},
{definition,[{<<"ha-mode">>,<<"all">>}]},
{priority,0}],
[{<6872.28270.5>,<6872.28269.5>},{<0.32304.1>,<0.30804.0>}]},
true,#Fun<rabbit_amqqueue_process.5.64830354>]},
{rabbit_amqqueue_process,handle_call,3},
{gen_server2,handle_msg,2},
{proc_lib,wake_up,3}]}

node 2 comes online i can now query rabbitmqctl status
starting up node 1, comes online
the cluster is now working again but several durables queues are gone(!)






More information about the rabbitmq-discuss mailing list