[rabbitmq-discuss] Bring cluster up after node crash

Tim Watson tim at rabbitmq.com
Wed Mar 20 16:25:58 GMT 2013


Hi Carl,

What version of rabbit are you running? A number of bugs pertaining to the 'Discarding message ... in an old incarnation .. of this node' were fixed in recent(ish) releases.

Cheers,
Tim 

On 19 Mar 2013, at 03:41, Carl Hörberg wrote:

> Have a 3 node cluster, node 2 and 3 went down due to OOM, but node 1 survived, clients could push new messages but non were delivered, node 1 had plenty of memory left so no blocking were (or at least shouldn't have been) in action due to that.  
> 
> I then tried to bring node 2 and 3 back online by simply restarting them, this is what happened:  
> 
> Node1 floods the logs for a while at a rate of 20-100/sec:
> =ERROR REPORT==== 18-Mar-2013::07:10:40 ===
> Discarding message {'$gen_call',{<0.17965.1>,#Ref<0.0.1.90282>},stat} from <0.17965.1> to <0.5037.1> in an old incarnation (1) of this node (2)
> 
> Start up node 3
> Floods
> =ERROR REPORT==== 18-Mar-2013::08:23:15 ===
> Discarding message {'$gen_call',{<0.7609.0>,#Ref<0.0.1.142489>},stat} from <0.7609.0> to <0.25515.26> in an old incarnation (1) of this node (3)
> and is stuck at  
> "starting exchange, queue and binding recovery ..."
> 
> rabbitmqctl status hangs for ever on node 1
> 
> Start up node 2, starts fast, says "Broker started" in startup_log, but doesn't list the plugins, "service rabbitmq-server start" never returns and  rabbitmqctl status and  never returns
> 
> node 2 then runs out of memory again, without client connections this time:  
> =INFO REPORT==== 18-Mar-2013::09:09:35 ===
> vm_memory_high_watermark set. Memory used:7336394640 allowed:7031336140
> =WARNING REPORT==== 18-Mar-2013::09:09:35 ===
> memory resource limit alarm set on node rabbit at tiger02
> 
> Querying /api/overview at node1 gives:
> {error,{error,{badmatch,false},
> [{rabbit_mgmt_wm_overview,version,1},
> {rabbit_mgmt_wm_overview,to_json,2},
> {webmachine_resource,resource_call,3},
> {webmachine_resource,do,3},
> {webmachine_decision_core,resource_call,1},
> {webmachine_decision_core,decision,1},
> {webmachine_decision_core,handle_request,2},
> {rabbit_webmachine,'-makeloop/1-fun-0-',2}]}}
> 
> node 3 starts eventually.  
> kills node 2, starts again, stops at "starting database …"
> nothing in the log or startup_err, cpu usage 0%
> kills after 30min and starts again, same thing.  
> 
> node 3 can now output rabbitmqctl status, node 1 still cannot.
> node 1 can't be shutdown, force kills
> with node1 down, node 2 now comes pass "starting database" and starts
> neither node 2 or node 3 responds to rabbitmqctl status
> shutting down node 2, but doesn't respond, have to do kill -9
> node 3 still doesn't respond to rabbitmqctl status
> shutdowns node 3, doesnt respond, killing it instead, now all nodes are down.
> 
> note: When rabbitmqctl status doesnt work other stuff like list_users, cluster_status etc. works.  
> 
> Starting up node3, log now gets flooded with:
> =ERROR REPORT==== 18-Mar-2013::11:09:04 ===
> ** Generic server <0.629.0> terminating
> ** Last message in was {init,<0.182.0>}
> ** When Server state == {q,{amqqueue,
> {resource,<<"vhost1">>,queue,
> <<"tmp_topic-0.9934227485209703">>},
> true,true,<0.21310.24>,[],<0.629.0>,[],[],
> [{vhost,<<"vhost1">>},
> {name,<<"HA">>},
> {pattern,<<".*">>},
> {definition,[{<<"ha-mode">>,<<"all">>}]},
> {priority,0}],
> [{<6868.7071.0>,<6868.7070.0>},
> {<6867.19845.80>,<6867.19844.80>},
> {<0.21601.24>,<0.21548.24>}]},
> none,false,undefined,undefined,
> {[],[]},
> undefined,undefined,undefined,undefined,
> {state,fine,5000,undefined},
> {0,nil},
> undefined,undefined,undefined,
> {dict,0,16,16,8,80,48,
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []},
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> 1,
> {{0,nil},{0,nil}},
> undefined,
> {dict,0,16,16,8,80,48,
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []},
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> undefined,undefined}
> ** Reason for termination ==  
> ** {'module could not be loaded',
> [{undefined,init,
> [{amqqueue,
> {resource,<<"vhost1">>,queue,
> <<"tmp_topic-0.9934227485209703">>},
> true,true,<0.21310.24>,[],<0.629.0>,[],[],
> [{vhost,<<"vhost1">>},
> {name,<<"HA">>},
> {pattern,<<".*">>},
> {definition,[{<<"ha-mode">>,<<"all">>}]},
> {priority,0}],
> [{<6868.7071.0>,<6868.7070.0>},
> {<6867.19845.80>,<6867.19844.80>},
> {<0.21601.24>,<0.21548.24>}]},
> true,#Fun<rabbit_amqqueue_process.5.64830354>]},
> {rabbit_amqqueue_process,handle_call,3},
> {gen_server2,handle_msg,2},
> {proc_lib,wake_up,3}]}
> 
> but comes online eventually and can do "rabbitmqctl status"
> 
> starts up node2, also reports a lot of:
> =ERROR REPORT==== 18-Mar-2013::11:11:06 ===
> ** Generic server <0.640.0> terminating
> ** Last message in was {init,<0.152.0>}
> ** When Server state == {q,{amqqueue,
> {resource,<<"vhost1">>,queue,
> <<"tmp_topic-0.1019297200255096">>},
> true,true,<0.977.11>,[],<0.640.0>,[],[],
> undefined,[]},
> none,false,undefined,undefined,
> {[],[]},
> undefined,undefined,undefined,undefined,
> {state,fine,5000,undefined},
> {0,nil},
> undefined,undefined,undefined,
> {dict,0,16,16,8,80,48,
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []},
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> 1,
> {{0,nil},{0,nil}},
> undefined,
> {dict,0,16,16,8,80,48,
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []},
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> undefined,undefined}
> ** Reason for termination ==  
> ** {'module could not be loaded',
> [{undefined,init,
> [{amqqueue,
> {resource,<<"vhost1">>,queue,
> <<"tmp_topic-0.1019297200255096">>},
> true,true,<0.977.11>,[],<0.640.0>,[],[],undefined,[]},
> true,#Fun<rabbit_amqqueue_process.5.64830354>]},
> {rabbit_amqqueue_process,handle_call,3},
> {gen_server2,handle_msg,2},
> {proc_lib,wake_up,3}]}
> =ERROR REPORT==== 18-Mar-2013::11:11:06 ===
> ** Generic server <0.645.0> terminating
> ** Last message in was {init,<0.152.0>}
> ** When Server state == {q,{amqqueue,
> {resource,<<"vhost1">>,queue,
> <<"tmp_topic-0.8794151877518743">>},
> true,true,<0.30538.0>,[],<0.645.0>,[],[],
> [{vhost,<<"vhost1">>},
> {name,<<"HA">>},
> {pattern,<<".*">>},
> {definition,[{<<"ha-mode">>,<<"all">>}]},
> {priority,0}],
> [{<6872.28270.5>,<6872.28269.5>},
> {<0.32304.1>,<0.30804.0>}]},
> none,false,undefined,undefined,
> {[],[]},
> undefined,undefined,undefined,undefined,
> {state,fine,5000,undefined},
> {0,nil},
> undefined,undefined,undefined,
> {dict,0,16,16,8,80,48,
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []},
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> 1,
> {{0,nil},{0,nil}},
> undefined,
> {dict,0,16,16,8,80,48,
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []},
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> undefined,undefined}
> ** Reason for termination ==  
> ** {'module could not be loaded',
> [{undefined,init,
> [{amqqueue,
> {resource,<<"vhost1">>,queue,
> <<"tmp_topic-0.8794151877518743">>},
> true,true,<0.30538.0>,[],<0.645.0>,[],[],
> [{vhost,<<"vhost1">>},
> {name,<<"HA">>},
> {pattern,<<".*">>},
> {definition,[{<<"ha-mode">>,<<"all">>}]},
> {priority,0}],
> [{<6872.28270.5>,<6872.28269.5>},{<0.32304.1>,<0.30804.0>}]},
> true,#Fun<rabbit_amqqueue_process.5.64830354>]},
> {rabbit_amqqueue_process,handle_call,3},
> {gen_server2,handle_msg,2},
> {proc_lib,wake_up,3}]}
> 
> node 2 comes online i can now query rabbitmqctl status
> starting up node 1, comes online
> the cluster is now working again but several durables queues are gone(!)
> 
> 
> 
> 
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss



More information about the rabbitmq-discuss mailing list