<html><head><meta http-equiv="Content-Type" content="text/html; charset=us-ascii"><meta name="Generator" content="Microsoft Word 14 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
span.EmailStyle17
        {mso-style-type:personal-compose;
        font-family:"Calibri","sans-serif";
        color:windowtext;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-family:"Calibri","sans-serif";}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
--></style></head><body lang="EN-US" link="blue" vlink="purple"><div class="WordSection1"><p class="MsoNormal">Hello, we ran into an odd situation today where RabbitMQ seemed to start properly but it didn't load most of the durable queues from the mnesia. Running stop_app, then start_app brought back some of the queues but not all. After we found out that not all queues were restored (after a few hours), running stop_app, then start_app again brought the rest of the queues online. Has anyone run into a similar situation?</p>
<p class="MsoNormal"> </p><p class="MsoNormal">Here are some notes about our setup with a few log entries from the logs below. We have 6 machines in the cluster split into pairs running drbd and pacemaker for failover. A glitchy switch caused one of these pairs to split-brain and both MQ resources wound up on the same physical host. Drbd seemed to be fine and after we resolved the split-brain, that's when we noticed the missing queues. There weren't any errors in the startup_log or startup_err files. We’re not using HA in rabbit itself, the queues are just persistent and durable on each node in the cluster.</p>
<p class="MsoNormal"> </p><p class="MsoNormal">We had a number of messages in the SASL logs with “nodedown” so I wonder if the MQ instances simply didn’t join the cluster properly the first couple of times but finally did on the last try? I didn’t check the status of the nodes in the cluster on each node (as suggested elsewhere) in between restarts but I’ll give that a try if it happens again. Thanks for your help!</p>
<p class="MsoNormal"> </p><p class="MsoNormal">RabbitMQ 2.5.1</p><p class="MsoNormal">Erlang R13B03</p><p class="MsoNormal">Ubuntu Server 64bit 2.6.38-10</p><p class="MsoNormal">drbd 8.3.7</p><p class="MsoNormal"> </p><p class="MsoNormal">
=ERROR REPORT==== 7-May-2012::10:11:30 ===</p><p class="MsoNormal">Mnesia('rabbit2@host2'): ** ERROR ** Mnesia on 'rabbit2@host2' could not connect to node(s) ['rabbit1@host1']</p><p class="MsoNormal">
</p><p class="MsoNormal">=INFO REPORT==== 7-May-2012::10:11:30 ===</p><p class="MsoNormal">Limiting to approx 32668 file handles (29399 sockets)</p><p class="MsoNormal"> </p><p class="MsoNormal">=INFO REPORT==== 7-May-2012::10:12:46 ===</p>
<p class="MsoNormal">msg_store_transient: using rabbit_msg_store_ets_index to provide index</p><p class="MsoNormal"> </p><p class="MsoNormal">=INFO REPORT==== 7-May-2012::10:12:46 ===</p><p class="MsoNormal">msg_store_persistent: using rabbit_msg_store_ets_index to provide index</p>
<p class="MsoNormal"> </p><p class="MsoNormal">=WARNING REPORT==== 7-May-2012::10:12:46 ===</p><p class="MsoNormal">msg_store_persistent: rebuilding indices from scratch</p><p class="MsoNormal"> </p><p class="MsoNormal">=INFO REPORT==== 7-May-2012::10:12:46 ===</p>
<p class="MsoNormal">started TCP Listener on <a href="http://192.168.1.1:5672">192.168.1.1:5672</a></p><p class="MsoNormal"> </p><p class="MsoNormal">=ERROR REPORT==== 7-May-2012::10:13:42 ===</p><p class="MsoNormal">Mnesia('rabbit2@host2'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit1@host3'}</p>
<p class="MsoNormal"> </p><p class="MsoNormal">=ERROR REPORT==== 7-May-2012::10:13:42 ===</p><p class="MsoNormal">Mnesia('rabbit2@host2'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit2@host4'}</p>
<p class="MsoNormal"> </p><p class="MsoNormal">=SUPERVISOR REPORT==== 7-May-2012::10:13:32 ===</p><p class="MsoNormal"> Supervisor: {<0.11398.2442>,rabbit_channel_sup}</p><p class="MsoNormal"> Context: shutdown</p>
<p class="MsoNormal"> Reason: reached_max_restart_intensity</p><p class="MsoNormal"> Offender: [{pid,<0.11400.2442>},</p><p class="MsoNormal"> {name,channel},</p><p class="MsoNormal"> {mfa,</p>
<p class="MsoNormal"> {rabbit_channel,start_link,</p><p class="MsoNormal"> [1,<0.11368.2442>,<0.11399.2442>,<0.11368.2442>,</p><p class="MsoNormal"> rabbit_framing_amqp_0_9_1,</p>
<p class="MsoNormal"> {user,<<"my_app">>,true,</p><p class="MsoNormal"> rabbit_auth_backend_internal,</p><p class="MsoNormal"> {internal_user,<<"my_app">>,</p>
<p class="MsoNormal"> <<199,64,175,52,127,65,248,9,70,171,15,9,5,</p><p class="MsoNormal"> 122,73,4,195,147,238,67>>,</p><p class="MsoNormal">
true}},</p><p class="MsoNormal"> <<"/my_app">>,[],<0.11366.2442>,</p><p class="MsoNormal"> #Fun<rabbit_channel_sup.0.15412730>]}},</p>
<p class="MsoNormal"> {restart_type,intrinsic},</p><p class="MsoNormal"> {shutdown,4294967295},</p><p class="MsoNormal"> {child_type,worker}]</p><p class="MsoNormal"> </p>
<p class="MsoNormal"> </p><p class="MsoNormal">=CRASH REPORT==== 7-May-2012::10:13:33 ===</p><p class="MsoNormal"> crasher:</p><p class="MsoNormal"> initial call: gen:init_it/6</p><p class="MsoNormal"> pid: <0.25562.2442></p>
<p class="MsoNormal"> registered_name: []</p><p class="MsoNormal"> exception exit: {{badmatch,</p><p class="MsoNormal"> {error,</p><p class="MsoNormal"> [{<7748.8396.531>,</p>
<p class="MsoNormal"> {exit,</p><p class="MsoNormal"> {nodedown,'rabbit1@host1'},</p><p class="MsoNormal"> []}}]}},</p>
<p class="MsoNormal"> [{rabbit_channel,terminate,2},</p><p class="MsoNormal"> {gen_server2,terminate,3},</p><p class="MsoNormal"> {proc_lib,wake_up,3}]}</p><p class="MsoNormal">
in function gen_server2:terminate/3</p><p class="MsoNormal"> ancestors: [<0.25560.2442>,<0.25544.2442>,<0.25542.2442>,</p><p class="MsoNormal"> rabbit_tcp_client_sup,rabbit_sup,<0.124.0>]</p>
<p class="MsoNormal"> messages: []</p><p class="MsoNormal"> links: [<0.25560.2442>]</p><p class="MsoNormal"> dictionary: [{{exchange_stats,</p><p class="MsoNormal"> {resource,<<"/my_app">>,exchange,</p>
<p class="MsoNormal"> <<"service.exchange">>}},</p><p class="MsoNormal"> [{confirm,6},{publish,6}]},</p><p class="MsoNormal"> {{queue_exchange_stats,</p>
<p class="MsoNormal"> {<0.253.0>,</p><p class="MsoNormal"> {resource,<<"/my_app">>,exchange,</p><p class="MsoNormal"> <<"data.exchange">>}}},</p>
<p class="MsoNormal"> [{confirm,6},{publish,6}]},</p><p class="MsoNormal"> {delegate,delegate_4},</p><p class="MsoNormal"> {{monitoring,<0.253.0>},true},</p><p class="MsoNormal">
{{exchange_stats,</p><p class="MsoNormal"> {resource,<<"/my_app">>,exchange,</p><p class="MsoNormal"> <<"data.exchange">>}},</p>
<p class="MsoNormal"> [{confirm,6},{publish,6}]},</p><p class="MsoNormal"> {guid,{{11,<0.25562.2442>},11}}]</p><p class="MsoNormal"> trap_exit: true</p><p class="MsoNormal"> status: running</p>
<p class="MsoNormal"> heap_size: 987</p><p class="MsoNormal"> stack_size: 24</p><p class="MsoNormal"> reductions: 11357</p><p class="MsoNormal"> neighbours:</p><p class="MsoNormal"> </p><p class="MsoNormal">=SUPERVISOR REPORT==== 7-May-2012::10:13:33 ===</p>
<p class="MsoNormal"> Supervisor: {<0.25560.2442>,rabbit_channel_sup}</p><p class="MsoNormal"> Context: child_terminated</p><p class="MsoNormal"> Reason: {{badmatch,</p><p class="MsoNormal"> {error,</p>
<p class="MsoNormal"> [{<7748.8396.531>,</p><p class="MsoNormal"> {exit,{nodedown,'rabbit1@host1'},[]}}]}},</p><p class="MsoNormal"> [{rabbit_channel,terminate,2},</p>
<p class="MsoNormal"> {gen_server2,terminate,3},</p><p class="MsoNormal"> {proc_lib,wake_up,3}]}</p><p class="MsoNormal"> Offender: [{pid,<0.25562.2442>},</p><p class="MsoNormal">
{name,channel},</p><p class="MsoNormal"> {mfa,</p><p class="MsoNormal"> {rabbit_channel,start_link,</p><p class="MsoNormal"> [1,<0.25545.2442>,<0.25561.2442>,<0.25545.2442>,</p>
<p class="MsoNormal"> rabbit_framing_amqp_0_9_1,</p><p class="MsoNormal"> {user,<<"my_app">>,true,</p><p class="MsoNormal"> rabbit_auth_backend_internal,</p>
<p class="MsoNormal"> {internal_user,<<"my_app">>,</p><p class="MsoNormal"> <<199,64,175,52,127,65,248,9,70,171,15,9,5,</p><p class="MsoNormal">
122,73,4,195,147,238,67>>,</p><p class="MsoNormal"> true}},</p><p class="MsoNormal"> <<"/my_app">>,[],<0.25543.2442>,</p>
<p class="MsoNormal"> #Fun<rabbit_channel_sup.0.15412730>]}},</p><p class="MsoNormal"> {restart_type,intrinsic},</p><p class="MsoNormal"> {shutdown,4294967295},</p>
<p class="MsoNormal"> {child_type,worker}]</p><p class="MsoNormal"> </p></div></body></html>