[rabbitmq-discuss] Bring cluster up after node crash

Wed Mar 20 18:03:45 GMT 2013

Hi Carl,

On 20 Mar 2013, at 16:25, Tim Watson wrote:
> What version of rabbit are you running? A number of bugs pertaining to the 'Discarding message ... in an old incarnation .. of this node' were fixed in recent(ish) releases.
> 

And another couple of questions if that's ok. Firstly - how did you install RabbitMQ on each of these nodes? It's possible one or more installs is corrupted somehow - have you made any modifications to the installs? What does the config look like for each of the nodes?  

> On 19 Mar 2013, at 03:41, Carl Hörberg wrote:
>> Node1 floods the logs for a while at a rate of 20-100/sec:
>> =ERROR REPORT==== 18-Mar-2013::07:10:40 ===
>> Discarding message {'$gen_call',{<0.17965.1>,#Ref<0.0.1.90282>},stat} from <0.17965.1> to <0.5037.1> in an old incarnation (1) of this node (2)
>> 
>> Start up node 3
>> Floods
>> =ERROR REPORT==== 18-Mar-2013::08:23:15 ===
>> Discarding message {'$gen_call',{<0.7609.0>,#Ref<0.0.1.142489>},stat} from <0.7609.0> to <0.25515.26> in an old incarnation (1) of this node (3)
>> and is stuck at  
>> "starting exchange, queue and binding recovery ..."
>> 

This 'old incarnation of ... ' stuff indicates that we have a process id for a queue that is no longer valid. In theory, the only way (I can see) for this to happen is if a queue master restarts faster than any of the slaves can detect it's death (we have an outstanding bug to look at that, but it may not be relevant since recent releases have included several HA bug fixes) - but regardless, that kind of problem ought to present far earlier than the 'stat' request that's failing...

>> Start up node 2, starts fast, says "Broker started" in startup_log, but doesn't list the plugins, "service rabbitmq-server start" never returns and  rabbitmqctl status and  never returns
>> 

That sounds suspicious - are you sure the enabled-plugins file and configuration for that node are intact?

>> node 2 then runs out of memory again, without client connections this time:  
>> =INFO REPORT==== 18-Mar-2013::09:09:35 ===
>> vm_memory_high_watermark set. Memory used:7336394640 allowed:7031336140
>> =WARNING REPORT==== 18-Mar-2013::09:09:35 ===
>> memory resource limit alarm set on node rabbit at tiger02
>> 

Is this happening whilst node 1 is still stuck? How long does it take (roughly) to reach this state?

>> Querying /api/overview at node1 gives:
>> {error,{error,{badmatch,false},
>> [{rabbit_mgmt_wm_overview,version,1},
>> {rabbit_mgmt_wm_overview,to_json,2},
>> {webmachine_resource,resource_call,3},
>> {webmachine_resource,do,3},
>> {webmachine_decision_core,resource_call,1},
>> {webmachine_decision_core,decision,1},
>> {webmachine_decision_core,handle_request,2},
>> {rabbit_webmachine,'-makeloop/1-fun-0-',2}]}}
>> 

What version of Erlang are you running? Upgrading to a recent version of Erlang would be a good idea due to bug fixes and the fact that line numbers in exception stack traces would make it easier to identify where things are going wrong.

For that matter, what OS/Platform are you running on? How did you install Erlang?

>> node 3 starts eventually.  
>> kills node 2, starts again, stops at "starting database …"

What do you mean 'kills node 2' exactly? A node will never kill another node. Do you mean that 'you' killed node 2? If so, how did you do this?

>> nothing in the log or startup_err, cpu usage 0%
>> kills after 30min and starts again, same thing.  
>> 

Again, what do you mean 'kills after 30min and starts again' - is this something you're doing? How are you 'killing' these nodes?

>> node 3 can now output rabbitmqctl status, node 1 still cannot.
>> node 1 can't be shutdown, force kills

Right - so at this point you've done something like `kill -9` right?

>> with node1 down, node 2 now comes pass "starting database" and starts
>> neither node 2 or node 3 responds to rabbitmqctl status

For how long do they not respond? I wonder if it could be that all these 'kill' signals you're issuing have left the mnesia database in an inconsistent state somehow.

>> shutting down node 2, but doesn't respond, have to do kill -9

'shutting down node 2' how - are you issuing `sudo rabbitmqctl stop` to do that?

>> node 3 still doesn't respond to rabbitmqctl status
>> shutdowns node 3, doesnt respond, killing it instead, now all nodes are down.
>> 

The same approach right?

>> note: When rabbitmqctl status doesnt work other stuff like list_users, cluster_status etc. works.  
>> 

Sounds like a process is stuck somewhere - the status call attempts to list all running erlang applications on the node, with the timeout set to 'infinity'. If an application has got stuck during startup (or shutdown!) that can be one of the symptoms. Again, please tell us which version of rabbit you're running. We've fixed bugs in (relatively) recent releases that presented as supervision trees getting stuck during shutdown/restart, which might (possibly) explain some of this.

>> Starting up node3, log now gets flooded with:
>> =ERROR REPORT==== 18-Mar-2013::11:09:04 ===
>> ** Generic server <0.629.0> terminating
>> ** Last message in was {init,<0.182.0>}
>> ** When Server state == {q,{amqqueue,
[snip]
>> ** Reason for termination ==  
>> ** {'module could not be loaded',
>> [{undefined,init,
[snip]

This error has occurred because the backing queue module for the queue process is set to 'undefined' - have you made any configuration changes, such as setting the name of the backing queue module by any chance?

Please let us know the answers to these queries and we'll try to figure out what's going on.

Cheers,
Tim