[rabbitmq-discuss] Missing Durable Queues on Startup
Chris Larsen
clarsen at euphoriaaudio.com
Tue May 8 01:54:00 BST 2012
Hello, we ran into an odd situation today where RabbitMQ seemed to start
properly but it didn't load most of the durable queues from the mnesia.
Running stop_app, then start_app brought back some of the queues but not
all. After we found out that not all queues were restored (after a few
hours), running stop_app, then start_app again brought the rest of the
queues online. Has anyone run into a similar situation?
Here are some notes about our setup with a few log entries from the logs
below. We have 6 machines in the cluster split into pairs running drbd and
pacemaker for failover. A glitchy switch caused one of these pairs to
split-brain and both MQ resources wound up on the same physical host. Drbd
seemed to be fine and after we resolved the split-brain, that's when we
noticed the missing queues. There weren't any errors in the startup_log or
startup_err files. We're not using HA in rabbit itself, the queues are just
persistent and durable on each node in the cluster.
We had a number of messages in the SASL logs with "nodedown" so I wonder if
the MQ instances simply didn't join the cluster properly the first couple of
times but finally did on the last try? I didn't check the status of the
nodes in the cluster on each node (as suggested elsewhere) in between
restarts but I'll give that a try if it happens again. Thanks for your help!
RabbitMQ 2.5.1
Erlang R13B03
Ubuntu Server 64bit 2.6.38-10
drbd 8.3.7
=ERROR REPORT==== 7-May-2012::10:11:30 ===
Mnesia('rabbit2 at host2'): ** ERROR ** Mnesia on 'rabbit2 at host2' could not
connect to node(s) ['rabbit1 at host1']
=INFO REPORT==== 7-May-2012::10:11:30 ===
Limiting to approx 32668 file handles (29399 sockets)
=INFO REPORT==== 7-May-2012::10:12:46 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index
=INFO REPORT==== 7-May-2012::10:12:46 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index
=WARNING REPORT==== 7-May-2012::10:12:46 ===
msg_store_persistent: rebuilding indices from scratch
=INFO REPORT==== 7-May-2012::10:12:46 ===
started TCP Listener on 192.168.1.1:5672
=ERROR REPORT==== 7-May-2012::10:13:42 ===
Mnesia('rabbit2 at host2'): ** ERROR ** mnesia_event got
{inconsistent_database, starting_partitioned_network, 'rabbit1 at host3'}
=ERROR REPORT==== 7-May-2012::10:13:42 ===
Mnesia('rabbit2 at host2'): ** ERROR ** mnesia_event got
{inconsistent_database, starting_partitioned_network, 'rabbit2 at host4'}
=SUPERVISOR REPORT==== 7-May-2012::10:13:32 ===
Supervisor: {<0.11398.2442>,rabbit_channel_sup}
Context: shutdown
Reason: reached_max_restart_intensity
Offender: [{pid,<0.11400.2442>},
{name,channel},
{mfa,
{rabbit_channel,start_link,
[1,<0.11368.2442>,<0.11399.2442>,<0.11368.2442>,
rabbit_framing_amqp_0_9_1,
{user,<<"my_app">>,true,
rabbit_auth_backend_internal,
{internal_user,<<"my_app">>,
<<199,64,175,52,127,65,248,9,70,171,15,9,5,
122,73,4,195,147,238,67>>,
true}},
<<"/my_app">>,[],<0.11366.2442>,
#Fun<rabbit_channel_sup.0.15412730>]}},
{restart_type,intrinsic},
{shutdown,4294967295},
{child_type,worker}]
=CRASH REPORT==== 7-May-2012::10:13:33 ===
crasher:
initial call: gen:init_it/6
pid: <0.25562.2442>
registered_name: []
exception exit: {{badmatch,
{error,
[{<7748.8396.531>,
{exit,
{nodedown,'rabbit1 at host1'},
[]}}]}},
[{rabbit_channel,terminate,2},
{gen_server2,terminate,3},
{proc_lib,wake_up,3}]}
in function gen_server2:terminate/3
ancestors: [<0.25560.2442>,<0.25544.2442>,<0.25542.2442>,
rabbit_tcp_client_sup,rabbit_sup,<0.124.0>]
messages: []
links: [<0.25560.2442>]
dictionary: [{{exchange_stats,
{resource,<<"/my_app">>,exchange,
<<"service.exchange">>}},
[{confirm,6},{publish,6}]},
{{queue_exchange_stats,
{<0.253.0>,
{resource,<<"/my_app">>,exchange,
<<"data.exchange">>}}},
[{confirm,6},{publish,6}]},
{delegate,delegate_4},
{{monitoring,<0.253.0>},true},
{{exchange_stats,
{resource,<<"/my_app">>,exchange,
<<"data.exchange">>}},
[{confirm,6},{publish,6}]},
{guid,{{11,<0.25562.2442>},11}}]
trap_exit: true
status: running
heap_size: 987
stack_size: 24
reductions: 11357
neighbours:
=SUPERVISOR REPORT==== 7-May-2012::10:13:33 ===
Supervisor: {<0.25560.2442>,rabbit_channel_sup}
Context: child_terminated
Reason: {{badmatch,
{error,
[{<7748.8396.531>,
{exit,{nodedown,'rabbit1 at host1'},[]}}]}},
[{rabbit_channel,terminate,2},
{gen_server2,terminate,3},
{proc_lib,wake_up,3}]}
Offender: [{pid,<0.25562.2442>},
{name,channel},
{mfa,
{rabbit_channel,start_link,
[1,<0.25545.2442>,<0.25561.2442>,<0.25545.2442>,
rabbit_framing_amqp_0_9_1,
{user,<<"my_app">>,true,
rabbit_auth_backend_internal,
{internal_user,<<"my_app">>,
<<199,64,175,52,127,65,248,9,70,171,15,9,5,
122,73,4,195,147,238,67>>,
true}},
<<"/my_app">>,[],<0.25543.2442>,
#Fun<rabbit_channel_sup.0.15412730>]}},
{restart_type,intrinsic},
{shutdown,4294967295},
{child_type,worker}]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120507/41fc7d3b/attachment.htm>
More information about the rabbitmq-discuss
mailing list