[rabbitmq-discuss] 2.8.7 - Unexpected termination of HA node
Tim Watson
tim at rabbitmq.com
Mon Oct 8 12:02:11 BST 2012
Hi Nathanael,
We are aware of a bug regarding a stale master pid, which results from a master node restarting before the slave notices, and we *are* trying to fix this situation! The changes between 2.8.6 and 2.8.7 did not explicitly aim to fix this bug, and we previously saw symptoms that the master would not get restarted and that node would not be mirrored correctly - this sounds exactly like 'the queue [would] vanish from the cluster" which you describe. Clearly one of the 2.8.7 changes has caused a worse symptom to appear when the stale master is detected, so first of all, thanks very much for reporting this!
If you could provide me with logs for the two nodes, from both the master (during startup) and the slave that died, that would enormously helpful. From what I'm reading in your email, the second node actually crashes altogether when it detects that the master dies right? Can you please make sure the logs you post are in files with the node name and/or some indication in the file name whether that log is for the master or the slave? That would be really helpful in replicating and trying to fix this behaviour.
Cheers,
Tim
On 8 Oct 2012, at 01:56, Rensen, Nathanael wrote:
> While testing RabbitMQ version 2.8.7 I've encountered an issue where the slave node in a HA pair terminates when the master is forcefully rebooted.
>
> The environment is Centos 6.3 with RabbitMQ 2.8.7, Erlang R14B, with a durable queue mirrored across a pair of nodes in a HA cluster. The master is rebooted using reboot -nf to simulate an ungraceful failure. When the master comes back up the slave node terminates with error log shown below.
>
> The behaviour when testing with version 2.8.6 was that the queue would vanish from the cluster. With version 2.8.7 the queue remains intact, but the slave node terminates.
>
> I've found that if the rabbitmq-server is not configured to start at boot the slave node eventually promotes itself to master and survives the eventual restart of the original master.
>
> Nathanael
>
> =INFO REPORT==== 5-Oct-2012::17:24:35 ===
> rabbit on node 'rabbit at zg-dev-nr-002' down
>
> =INFO REPORT==== 5-Oct-2012::17:24:35 ===
> Statistics database started.
>
> =INFO REPORT==== 5-Oct-2012::17:24:35 ===
> Mirrored-queue (queue 'zg-ha-test' in vhost '/'): Slave
> <'rabbit at zg-dev-nr-001'.3.252.0> saw deaths of mirrors
> <'rabbit at zg-dev-nr-002'.2.252.0>
>
> =INFO REPORT==== 5-Oct-2012::17:24:35 ===
> Mirrored-queue (queue 'zg-ha-test' in vhost '/'): Promoting slave
> <'rabbit at zg-dev-nr-001'.3.252.0> to master
>
> =ERROR REPORT==== 5-Oct-2012::17:24:35 ===
> ** Generic server <0.253.0> terminating
> ** Last message in was {mnesia_tm,'rabbit at zg-dev-nr-002',
> {vote_yes,{tid,2235,<0.253.0>}}}
> ** When Server state == {state,
> {3,<0.253.0>},
> {{3,<0.253.0>},undefined},
> {{3,<0.253.0>},undefined},
> {resource,<<"/">>,queue,<<"zg-ha-test">>},
> rabbit_mirror_queue_coordinator,
> {6,
> [{{3,<0.253.0>},
> {view_member,
> {3,<0.253.0>},
> [],
> {3,<0.253.0>},
> {3,<0.253.0>}}}]},
> 0,[],
> [<0.336.0>],
> {[],[]},
> [],undefined}
> ** Reason for termination ==
> ** {function_clause,
> [{gm,handle_info,
> [{mnesia_tm,'rabbit at zg-dev-nr-002',
> {vote_yes,{tid,2235,<0.253.0>}}},
> {state,
> {3,<0.253.0>},
> {{3,<0.253.0>},undefined},
> {{3,<0.253.0>},undefined},
> {resource,<<"/">>,queue,<<"zg-ha-test">>},
> rabbit_mirror_queue_coordinator,
> {6,
> [{{3,<0.253.0>},
> {view_member,
> {3,<0.253.0>},
> [],
> {3,<0.253.0>},
> {3,<0.253.0>}}}]},
> 0,[],
> [<0.336.0>],
> {[],[]},
> [],undefined}]},
> {gen_server2,handle_msg,2},
> {proc_lib,wake_up,3}]}
>
> =ERROR REPORT==== 5-Oct-2012::17:24:36 ===
> ** Generic server <0.252.0> terminating
> ** Last message in was {'EXIT',<0.336.0>,
> {function_clause,
> [{gm,handle_info,
> [{mnesia_tm,'rabbit at zg-dev-nr-002',
> {vote_yes,{tid,2235,<0.253.0>}}},
> {state,
> {3,<0.253.0>},
> {{3,<0.253.0>},undefined},
> {{3,<0.253.0>},undefined},
> {resource,<<"/">>,queue,<<"zg-ha-test">>},
> rabbit_mirror_queue_coordinator,
> {6,
> [{{3,<0.253.0>},
> {view_member,
> {3,<0.253.0>},
> [],
> {3,<0.253.0>},
> {3,<0.253.0>}}}]},
> 0,[],
> [<0.336.0>],
> {[],[]},
> [],undefined}]},
> {gen_server2,handle_msg,2},
> {proc_lib,wake_up,3}]}}
> ** When Server state == {q,
> {amqqueue,
> {resource,<<"/">>,queue,<<"zg-ha-test">>},
> true,false,none,
> [{<<"x-ha-policy">>,longstr,<<"all">>}],
> <0.252.0>,[],all},
> none,false,rabbit_mirror_queue_master,
> {state,<0.253.0>,<0.336.0>,rabbit_variable_queue,
> {vqstate,
> {0,{[],[]}},
> {0,{[],[]}},
> {delta,undefined,0,undefined},
> {0,{[],[]}},
> [... lots of stuff omitted...]
> {291,
> 29148,
> {0,nil},
> undefined,
> {0,nil},
> {qistate,
>
> "/var/lib/rabbitmq/mnesia/rabbit at zg-dev-nr-001/queues/EVSP28XGBN529KMXDI7YW0GM2",
> {{dict,0,16,16,8,80,48,
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []},
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> []},
> undefined,0,262144,
> #Fun<rabbit_variable_queue.2.121033067>,
> {0,nil}},
> {{client_msstate,msg_store_persistent,
> <<64,124,193,190,179,5,132,252,0,55,122,227,216,
> 123,112,237>>,
> {dict,0,16,16,8,80,48,
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []},
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> {state,303177,
>
> "/var/lib/rabbitmq/mnesia/rabbit at zg-dev-nr-001/msg_store_persistent"},
> rabbit_msg_store_ets_index,
>
> "/var/lib/rabbitmq/mnesia/rabbit at zg-dev-nr-001/msg_store_persistent",
> <0.246.0>,307274,299080,311371,315468},
> {client_msstate,msg_store_transient,
> <<38,48,43,9,149,2,213,22,77,78,223,24,149,53,175,
> 245>>,
> {dict,0,16,16,8,80,48,
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []},
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> {state,282692,
>
> "/var/lib/rabbitmq/mnesia/rabbit at zg-dev-nr-001/msg_store_transient"},
> rabbit_msg_store_ets_index,
>
> "/var/lib/rabbitmq/mnesia/rabbit at zg-dev-nr-001/msg_store_transient",
> <0.241.0>,286789,278595,290886,294983}},
> true,0,#Fun<rabbit_mirror_queue_slave.1.40624233>,
> 291,0,infinity,291,0,152,0,291,
> {rates,
> {{1349,429050,591110},0},
> {{1349,429050,591110},0},
> 1076.4632093998348,1076.4632093998348,
> {1349,429051,592073}},
> {0,nil},
> {0,nil},
> {0,nil},
> {0,nil},
> 0,0,
> {rates,
> {{1349,429050,591110},0},
> {{1349,429050,591110},0},
> 1071.7981467372135,1076.4632093998348,
> {1349,429051,592073}}},
> 0,
> {dict,6,16,16,8,80,48,
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
> {{[],[],
> [[<<218,200,61,58,97,50,72,197,144,158,220,215,
> 159,163,253,105>>|
> published]],
> [[<<80,182,167,56,95,139,134,134,192,40,123,239,
> 62,193,49,206>>|
> published],
> [<<159,219,2,228,110,119,150,246,97,74,183,72,
> 254,233,74,33>>|
> published],
> [<<223,108,153,205,69,107,192,83,254,145,181,
> 172,15,61,33,190>>|
> published]],
> [],[],[],[],
> [[<<8,84,74,65,208,81,184,115,74,86,225,237,241,
> 172,148,18>>|
> published]],
> [],[],[],[],[],
> [[<<"õjq$ùøÞS!ý,a»úuÿ">>|published]],
> []}}},
> [],
> {dict,0,16,16,8,80,48,
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> {set,0,16,16,8,80,48,
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}}},
> {[],[]},
> undefined,undefined,#Ref<0.0.0.1342>,undefined,
> {state,fine,5000,#Ref<0.0.0.1406>},
> {0,nil},
> undefined,undefined,undefined,
> {dict,1,16,16,8,80,48,
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
> {{[],[],[],[],[],[],
> [[<7527.317.0>|#Ref<0.0.0.732>]],
> [],[],[],[],[],[],[],[],[]}}},
> 1,
> {{0,nil},{0,nil}},
> undefined,
> {dict,0,16,16,8,80,48,
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},
> undefined,undefined}
> ** Reason for termination ==
> ** {noproc,{gen_server2,call,[<0.253.0>,info,infinity]}}
> ** In 'terminate' callback with reason ==
> ** {function_clause,
> [{gm,handle_info,
> [{mnesia_tm,'rabbit at zg-dev-nr-002',
> {vote_yes,{tid,2235,<0.253.0>}}},
> {state,
> {3,<0.253.0>},
> {{3,<0.253.0>},undefined},
> {{3,<0.253.0>},undefined},
> {resource,<<"/">>,queue,<<"zg-ha-test">>},
> rabbit_mirror_queue_coordinator,
> {6,
> [{{3,<0.253.0>},
> {view_member,
> {3,<0.253.0>},
> [],
> {3,<0.253.0>},
> {3,<0.253.0>}}}]},
> 0,[],
> [<0.336.0>],
> {[],[]},
> [],undefined}]},
> {gen_server2,handle_msg,2},
> {proc_lib,wake_up,3}]}
>
> =INFO REPORT==== 5-Oct-2012::17:24:36 ===
> rabbit on node 'rabbit at zg-dev-nr-002' up
>
> =ERROR REPORT==== 5-Oct-2012::17:24:36 ===
> ** Generic server <0.292.0> terminating
> ** Last message in was {mnesia_tm,'rabbit at zg-dev-nr-002',
> {vote_yes,{tid,2236,<0.292.0>}}}
> ** When Server state == {state,<0.290.0>,<0.291.0>,rabbit_mgmt_sup,
> [{rabbit_mgmt_db,
> {rabbit_mgmt_db,start_link,[]},
> permanent,4294967295,worker,
> [rabbit_mgmt_db]}]}
> ** Reason for termination ==
> ** {unexpected_info,{mnesia_tm,'rabbit at zg-dev-nr-002',
> {vote_yes,{tid,2236,<0.292.0>}}}}
>
> =INFO REPORT==== 5-Oct-2012::17:24:36 ===
> application: rabbitmq_management
> exited: shutdown
> type: permanent
>
> =INFO REPORT==== 5-Oct-2012::17:24:36 ===
> stopped TCP Listener on [::]:5672
>
>
> ZettaServe Disclaimer: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately if you have received this email by mistake and delete this email from your system. Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. ZettaServe Pty Ltd accepts no liability for any damage caused by any virus transmitted by this email.
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
More information about the rabbitmq-discuss
mailing list