[rabbitmq-discuss] Loss of queues after network partition?
Tim Watson
tim at rabbitmq.com
Mon Oct 15 11:22:52 BST 2012
Hi Matt,
This is a problem we're aware of (i.e., we have an open bug which we're
working on!) and are looking at ways of handling it. What is basically
happening here is that mnesia gets into an inconsistent state due to a
network partition and the mnesia transaction manager sends unanticipated
(and unhandled) messages to certain Erlang processes causing a crash.
Whilst this falls into the category of "Rabbit clusters do not handle
network partitions" obviously we want to prevent this sort of thing
happening and as I say, we're looking at various ways of responding in a
somewhat more user friendly manner.
In the meanwhile, I'd be very interested to see if the queue information
has actually been deleted from mnesia when this happens. Are these HA
queues durable and do they all "come back to life" after restarting the
node(s)?
Cheers,
Tim
On 10/12/2012 09:01 PM, Matt Pietrek wrote:
> I'll be the first to admit that this is a somewhat unusual/complicated
> circumstance, but I'm posting this on the chance there's something
> unusual. This is RabbitMQ is *2.8.7 *on Erlang R15B01.
>
> We have two brokers (mq1, mq2) running clustered with all HA queues. A
> VIP is in front of them.
>
> At some point there was a network event, and afterwards, a queue
> vanished. (Still trying to figure out exactly what the network hiccup
> was, but it's fair to say there was a communication disruption between
> the nodes.) My guess from looking at the logs is that each node had a
> moment where it couldn't see the other, so took actions that were
> ultimately incompatible when both nodes were visible to each other.
>
> The actual error as seen by our code is a missing queue named
> "charon". You'll see it near the end of the rabbit at mq2.log (below):
>
> *{amqp_error,not_found,"no queue 'charon' in vhost '/'",'queue.declare'}*
>
> Prior to the network event, our client was successfully reading from
> the charon queue.
>
>
> *Here's the relevant parts of the rabbit at mq1.log file
> ====================================================*
> =INFO REPORT==== 11-Oct-2012::14:01:58 ===
> rabbit on node rabbit at mq2 down
>
> =ERROR REPORT==== 11-Oct-2012::14:02:03 ===
> Mnesia(rabbit at mq1): ** ERROR ** mnesia_event got
> {inconsistent_database, running_partitioned_network, rabbit at mq2}
>
> =ERROR REPORT==== 11-Oct-2012::14:02:03 ===
> ** Generic server <0.547.0> terminating
> ** Last message in was {mnesia_locker,rabbit at mq2,granted}
> ** When Server state == {state,
> {0,<0.547.0>},
> {{0,<0.547.0>},undefined},
> {{0,<0.547.0>},undefined},
> {resource,<<"/">>,queue,<<"charon">>},
> rabbit_mirror_queue_coordinator,
> {3,
> [{{0,<0.547.0>},
> {view_member,
> {0,<0.547.0>},
> [],
> {0,<0.547.0>},
> {0,<0.547.0>}}}]},
> 1,[],
> [<0.546.0>],
> {[],[]},
> [],undefined}
> ** Reason for termination ==
> ** {function_clause,
> [{gm,handle_info,
> [{mnesia_locker,rabbit at mq2,granted},
> {state,
> {0,<0.547.0>},
> {{0,<0.547.0>},undefined},
> {{0,<0.547.0>},undefined},
> {resource,<<"/">>,queue,<<"charon">>},
> rabbit_mirror_queue_coordinator,
> {3,
> [{{0,<0.547.0>},
> {view_member,
> {0,<0.547.0>},
> [],
> {0,<0.547.0>},
> {0,<0.547.0>}}}]},
> 1,[],
> [<0.546.0>],
> {[],[]},
> [],undefined}],
> [{file,"src/gm.erl"},{line,665}]},
>
> {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,912}]},
> {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,237}]}]}
>
> =ERROR REPORT==== 11-Oct-2012::14:02:03 ===
> ** Generic server <0.545.0> terminating
> ** Last message in was {'EXIT',<0.546.0>,
> {function_clause,
> [{gm,handle_info,
> [{mnesia_locker,rabbit at mq2,granted},
> {state,
> {0,<0.547.0>},
> {{0,<0.547.0>},undefined},
> {{0,<0.547.0>},undefined},
> {resource,<<"/">>,queue,<<"charon">>},
> rabbit_mirror_queue_coordinator,
> {3,
> [{{0,<0.547.0>},
> {view_member,
> {0,<0.547.0>},
> [],
> {0,<0.547.0>},
> {0,<0.547.0>}}}]},
> 1,[],
> [<0.546.0>],
> {[],[]},
> [],undefined}],
> [{file,"src/gm.erl"},{line,665}]},
> {gen_server2,handle_msg,2,
> [{file,"src/gen_server2.erl"},{line,912}]},
> {proc_lib,wake_up,3,
> [{file,"proc_lib.erl"},{line,237}]}]}}
> ** When Server state == {q,
> {amqqueue,
> {resource,<<"/">>,queue,<<"charon">>},
> true,false,none,
> [{<<"x-ha-policy">>,longstr,<<"all">>}],
> <0.545.0>,[],all},
> none,false,rabbit_mirror_queue_master,
> {state,<0.547.0>,<0.546.0>,rabbit_variable_queue,
> {vqstate,
> {0,{[],[]}},
> {0,{[],[]}},
> {delta,undefined,0,undefined},
> {0,{[],[]}},
> {0,{[],[]}},
> 0,
> {0,nil},
> undefined,
> {0,nil},
> {qistate,
>
> "/highland/var/lib/rabbit at mq1/queues/4L2SY4TVRPM8HP85JDVO85BPL",
> {{dict,0,16,16,8,80,48,
>
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []},
>
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> []},
> undefined,0,262144,
> #Fun<rabbit_variable_queue.2.121500735>,
> {0,nil}},
> {{client_msstate,msg_store_persistent,
> <<112,32,84,92,54,73,67,43,231,35,164,79,106,8,27,
> 132>>,
> {dict,0,16,16,8,80,48,
>
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []},
>
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> {state,237637,
>
> "/highland/var/lib/rabbit at mq1/msg_store_persistent"},
> rabbit_msg_store_ets_index,
>
> "/highland/var/lib/rabbit at mq1/msg_store_persistent",
> <0.289.0>,241734,233540,245831,249928},
> {client_msstate,msg_store_transient,
> <<204,18,155,40,29,120,124,76,144,83,195,135,250,
> 91,216,3>>,
> {dict,0,16,16,8,80,48,
>
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []},
>
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> {state,217152,
>
> "/highland/var/lib/rabbit at mq1/msg_store_transient"},
> rabbit_msg_store_ets_index,
>
> "/highland/var/lib/rabbit at mq1/msg_store_transient",
> <0.284.0>,221249,213055,225346,229443}},
>
> true,0,#Fun<rabbit_amqqueue_process.5.24780734>,0,
> 0,infinity,0,0,0,0,0,
> {rates,
> {{1349,988602,967125},0},
> {{1349,988602,967125},0},
> 0.0,0.0,
> {1349,988606,141019}},
> {0,nil},
> {0,nil},
> {0,nil},
> {0,nil},
> 0,0,
> {rates,
> {{1349,988602,967125},0},
> {{1349,988602,967125},0},
> 0.0,0.0,
> {1349,988606,141019}}},
> 0,
> {dict,0,16,16,8,80,48,
>
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> [],
> {dict,0,16,16,8,80,48,
>
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> {set,0,16,16,8,80,48,
>
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}}},
> {[],[]},
> undefined,undefined,undefined,undefined,
> {state,fine,5000,undefined},
> {0,nil},
> undefined,undefined,undefined,
> {dict,0,16,16,8,80,48,
>
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
>
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},
> 1,
> {{0,nil},{0,nil}},
> undefined,
> {dict,0,16,16,8,80,48,
>
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
>
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},
> undefined,undefined}
> ** Reason for termination ==
> ** {noproc,{gen_server2,call,[<0.547.0>,info,infinity]}}
> ** In 'terminate' callback with reason ==
> ** {function_clause,
> [{gm,handle_info,
> [{mnesia_locker,rabbit at mq2,granted},
> {state,
> {0,<0.547.0>},
> {{0,<0.547.0>},undefined},
> {{0,<0.547.0>},undefined},
> {resource,<<"/">>,queue,<<"charon">>},
> rabbit_mirror_queue_coordinator,
> {3,
> [{{0,<0.547.0>},
> {view_member,
> {0,<0.547.0>},
> [],
> {0,<0.547.0>},
> {0,<0.547.0>}}}]},
> 1,[],
> [<0.546.0>],
> {[],[]},
> [],undefined}],
> [{file,"src/gm.erl"},{line,665}]},
>
> {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,912}]},
> {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,237}]}]}
>
> =INFO REPORT==== 11-Oct-2012::14:02:03 ===
> Mirrored-queue (queue 'cmcmd' in vhost '/'): Master
> <rabbit at mq1.2.665.0> saw deaths of mirrors <rabbit at mq2.1.383.0>
>
> =INFO REPORT==== 11-Oct-2012::14:02:03 ===
> Mirrored-queue (queue 'ConfigurationManager' in vhost '/'): Master
> <rabbit at mq1.2.605.0> saw deaths of mirrors <rabbit at mq2.1.377.0>
>
> =INFO REPORT==== 11-Oct-2012::14:02:03 ===
> global: Name conflict terminating {rabbit_mgmt_db,<3198.32532.0>}
>
> =INFO REPORT==== 11-Oct-2012::14:26:40 ===
> accepting AMQP connection <0.6507.1> (10.72.0.50:46742
> <http://10.72.0.50:46742> -> 10.72.0.47:5672 <http://10.72.0.47:5672>)
>
> =INFO REPORT==== 11-Oct-2012::14:26:40 ===
> closing AMQP connection <0.6507.1> (10.72.0.50:46742
> <http://10.72.0.50:46742> -> 10.72.0.47:5672 <http://10.72.0.47:5672>)
>
>
>
>
> *And here's the relevant parts of the rabbit at m2.log file
> ====================================================*
> =ERROR REPORT==== 11-Oct-2012::14:01:37 ===
> ** Node rabbit at mq1 not responding **
> ** Removing (timedout) connection **
>
> =INFO REPORT==== 11-Oct-2012::14:01:37 ===
> rabbit on node rabbit at mq1 down
>
> =INFO REPORT==== 11-Oct-2012::14:01:40 ===
> Statistics database started.
>
> =INFO REPORT==== 11-Oct-2012::14:01:43 ===
> Mirrored-queue (queue 'cmcmd' in vhost '/'): Slave
> <rabbit at mq2.1.383.0> saw deaths of mirrors <rabbit at mq1.2.665.0>
>
> =INFO REPORT==== 11-Oct-2012::14:01:43 ===
> Mirrored-queue (queue 'cmcmd' in vhost '/'): Promoting slave
> <rabbit at mq2.1.383.0> to master
>
> =INFO REPORT==== 11-Oct-2012::14:01:43 ===
> Mirrored-queue (queue 'charon' in vhost '/'): Slave
> <rabbit at mq2.1.371.0> saw deaths of mirrors <rabbit at mq1.2.545.0>
>
> =INFO REPORT==== 11-Oct-2012::14:01:43 ===
> Mirrored-queue (queue 'charon' in vhost '/'): Promoting slave
> <rabbit at mq2.1.371.0> to master
>
> =INFO REPORT==== 11-Oct-2012::14:01:43 ===
> Mirrored-queue (queue 'ConfigurationManager' in vhost '/'): Slave
> <rabbit at mq2.1.377.0> saw deaths of mirrors <rabbit at mq1.2.605.0>
>
> =INFO REPORT==== 11-Oct-2012::14:01:43 ===
> Mirrored-queue (queue 'ConfigurationManager' in vhost '/'): Promoting
> slave <rabbit at mq2.1.377.0> to master
>
> =ERROR REPORT==== 11-Oct-2012::14:05:03 ===
> Mnesia(rabbit at mq2): ** ERROR ** mnesia_event got
> {inconsistent_database, running_partitioned_network, rabbit at mq1}
>
> =INFO REPORT==== 11-Oct-2012::14:05:03 ===
> Statistics database started.
>
> =ERROR REPORT==== 11-Oct-2012::14:29:39 ===
> ** Generic server <0.32583.0> terminating
> ** Last message in was {'$gen_cast',{gm_deaths,[<0.5504.1>]}}
> ** When Server state == {state,
> {amqqueue,
> {resource,<<"/">>,queue,<<"charon">>},
> true,false,none,
> [{<<"x-ha-policy">>,longstr,<<"all">>}],
> <0.371.0>,[],all},
> <0.372.0>,
> {dict,0,16,16,8,80,48,
>
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []},
>
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> [],[]}}},
> #Fun<rabbit_mirror_queue_master.1.32244930>,
> #Fun<rabbit_mirror_queue_master.2.32244930>}
> ** Reason for termination ==
> ** {bad_return_value,
> {error,
> {{badmatch,[]},
> [{rabbit_mirror_queue_misc,'-remove_from_queue/2-fun-0-',2,
> [{file,"src/rabbit_mirror_queue_misc.erl"},{line,63}]},
> {mnesia_tm,apply_fun,3,[{file,"mnesia_tm.erl"},{line,829}]},
> {mnesia_tm,execute_transaction,5,
> [{file,"mnesia_tm.erl"},{line,809}]},
> {rabbit_misc,'-execute_mnesia_transaction/1-fun-0-',1,
> [{file,"src/rabbit_misc.erl"},{line,440}]},
> {worker_pool_worker,handle_call,3,
> [{file,"src/worker_pool_worker.erl"},{line,80}]},
> {gen_server2,handle_msg,2,
> [{file,"src/gen_server2.erl"},{line,879}]},
> {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,237}]}]}}}
>
> =ERROR REPORT==== 11-Oct-2012::14:29:39 ===
> ** Generic server <0.371.0> terminating
> ** Last message in was {'EXIT',<0.32583.0>,
> {bad_return_value,
> {error,
> {{badmatch,[]},
> [{rabbit_mirror_queue_misc,
> '-remove_from_queue/2-fun-0-',2,
> [{file,"src/rabbit_mirror_queue_misc.erl"},
> {line,63}]},
> {mnesia_tm,apply_fun,3,
> [{file,"mnesia_tm.erl"},{line,829}]},
> {mnesia_tm,execute_transaction,5,
> [{file,"mnesia_tm.erl"},{line,809}]},
> {rabbit_misc,
> '-execute_mnesia_transaction/1-fun-0-',1,
> [{file,"src/rabbit_misc.erl"},{line,440}]},
> {worker_pool_worker,handle_call,3,
>
> [{file,"src/worker_pool_worker.erl"},{line,80}]},
> {gen_server2,handle_msg,2,
> [{file,"src/gen_server2.erl"},{line,879}]},
> {proc_lib,wake_up,3,
> [{file,"proc_lib.erl"},{line,237}]}]}}}}
> ** When Server state == {q,
> {amqqueue,
> {resource,<<"/">>,queue,<<"charon">>},
> true,false,none,
> [{<<"x-ha-policy">>,longstr,<<"all">>}],
> <0.371.0>,[],all},
> none,false,rabbit_mirror_queue_master,
>
> {state,<0.372.0>,<0.32583.0>,rabbit_variable_queue,
> {vqstate,
> {0,{[],[]}},
> {0,{[],[]}},
> {delta,undefined,0,undefined},
> {0,{[],[]}},
> {0,{[],[]}},
> 0,
> {0,nil},
> undefined,
> {0,nil},
> {qistate,
>
> "/highland/var/lib/rabbit at mq2/queues/4L2SY4TVRPM8HP85JDVO85BPL",
> {{dict,0,16,16,8,80,48,
>
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []},
>
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> []},
> undefined,0,262144,
> #Fun<rabbit_variable_queue.2.121500735>,
> {0,nil}},
> {{client_msstate,msg_store_persistent,
> <<23,245,132,154,3,96,225,184,108,75,232,224,184,
> 114,108,141>>,
> {dict,0,16,16,8,80,48,
>
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []},
>
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> {state,282693,
>
> "/highland/var/lib/rabbit at mq2/msg_store_persistent"},
> rabbit_msg_store_ets_index,
>
> "/highland/var/lib/rabbit at mq2/msg_store_persistent",
> <0.285.0>,286790,278596,290887,294984},
> {client_msstate,msg_store_transient,
> <<17,62,75,125,126,21,66,37,170,44,198,72,18,94,
> 170,197>>,
> {dict,0,16,16,8,80,48,
>
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []},
>
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> {state,262208,
>
> "/highland/var/lib/rabbit at mq2/msg_store_transient"},
> rabbit_msg_store_ets_index,
>
> "/highland/var/lib/rabbit at mq2/msg_store_transient",
> <0.280.0>,266305,258111,270402,274499}},
>
> true,0,#Fun<rabbit_mirror_queue_slave.1.69615989>,
> 0,0,infinity,0,0,0,0,0,
> {rates,
> {{1349,989325,677031},0},
> {{1349,989325,677031},0},
> 0.0,0.0,
> {1349,989505,348090}},
> {0,nil},
> {0,nil},
> {0,nil},
> {0,nil},
> 0,0,
> {rates,
> {{1349,989325,677031},0},
> {{1349,989325,677031},0},
> 0.0,0.0,
> {1349,989505,348090}}},
> 0,
> {dict,0,16,16,8,80,48,
>
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> [],
> {dict,0,16,16,8,80,48,
>
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}},
> {set,0,16,16,8,80,48,
>
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
> []}}}},
> {[],[]},
> undefined,undefined,undefined,undefined,
> {state,fine,5000,undefined},
> {0,nil},
> undefined,undefined,undefined,
> {dict,0,16,16,8,80,48,
>
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
>
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},
> 1,
> {{0,nil},{0,nil}},
> undefined,
> {dict,0,16,16,8,80,48,
>
> {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
>
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},
> undefined,undefined}
> ** Reason for termination ==
> ** {noproc,{gen_server2,call,[<0.372.0>,info,infinity]}}
> ** In 'terminate' callback with reason ==
> ** {bad_return_value,
> {error,
> {{badmatch,[]},
> [{rabbit_mirror_queue_misc,'-remove_from_queue/2-fun-0-',2,
> [{file,"src/rabbit_mirror_queue_misc.erl"},{line,63}]},
> {mnesia_tm,apply_fun,3,[{file,"mnesia_tm.erl"},{line,829}]},
> {mnesia_tm,execute_transaction,5,
> [{file,"mnesia_tm.erl"},{line,809}]},
> {rabbit_misc,'-execute_mnesia_transaction/1-fun-0-',1,
> [{file,"src/rabbit_misc.erl"},{line,440}]},
> {worker_pool_worker,handle_call,3,
> [{file,"src/worker_pool_worker.erl"},{line,80}]},
> {gen_server2,handle_msg,2,
> [{file,"src/gen_server2.erl"},{line,879}]},
> {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,237}]}]}}}
>
> =INFO REPORT==== 11-Oct-2012::14:29:43 ===
> accepting AMQP connection <0.5524.1> (10.72.0.23:36056
> <http://10.72.0.23:36056> -> 10.72.0.56:5672 <http://10.72.0.56:5672>)
>
> =INFO REPORT==== 11-Oct-2012::14:29:44 ===
> accepting AMQP connection <0.5574.1> (10.72.0.23:36064
> <http://10.72.0.23:36064> -> 10.72.0.56:5672 <http://10.72.0.56:5672>)
>
> =INFO REPORT==== 11-Oct-2012::17:56:16 ===
> accepting AMQP connection <0.23868.1> (10.72.0.53:57122
> <http://10.72.0.53:57122> -> 10.72.0.56:5672 <http://10.72.0.56:5672>)
>
> =ERROR REPORT==== 11-Oct-2012::17:56:16 ===
> connection <0.23868.1>, channel 1 - error:
> {amqp_error,not_found,"no queue 'charon' in vhost '/'",'queue.declare'}
>
> =WARNING REPORT==== 11-Oct-2012::17:56:16 ===
> closing AMQP connection <0.23868.1> (10.72.0.53:57122
> <http://10.72.0.53:57122> -> 10.72.0.56:5672 <http://10.72.0.56:5672>):
> connection_closed_abruptly
>
> =INFO REPORT==== 11-Oct-2012::17:56:16 ===
> accepting AMQP connection <0.23878.1> (10.72.0.52:56025
> <http://10.72.0.52:56025> -> 10.72.0.56:5672 <http://10.72.0.56:5672>)
>
> =ERROR REPORT==== 11-Oct-2012::17:56:16 ===
> connection <0.23878.1>, channel 1 - error:
> {amqp_error,not_found,"no queue 'charon' in vhost '/'",'queue.declare'}
>
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20121015/f61759f4/attachment.htm>
More information about the rabbitmq-discuss
mailing list