I'm trying to track down a fun one. This is with 2.8.6. (We're in the process of moving these guys to 2.8.7, but want to understand what's happening first.)<br><br>We have two nodes, mq1 and mq2. They simultaneously lose communication with each other, breaking the cluster, although they still continue to function independently. That is, each one things the other is down.<br>
<br>Now the obvious solution is some sort of network partition. However, in all of our extensive logs and by pouring over all sorts of system data, I don't see any evidence of a a network blip. Not saying it's not possible, just pretty unlikely. The only thing of note I can think of is that we were in towards the end an "apt-get update" when this happened.<br>
<br>On mq1, the rabbit@mq1.log file is:<br><br>=ERROR REPORT==== 22-Oct-2012::09:40:28 ===<br>** Node rabbit@mq2 not responding **<br>** Removing (timedout) connection **<br><br>=ERROR REPORT==== 22-Oct-2012::09:40:28 ===<br>
webmachine error: path="/api/overview"<br>{error,<br> {error,function_clause,<br> [{rabbit_mgmt_wm_overview,'-contexts/1-lc$^0/1-0-',<br> [{badrpc,nodedown},rabbit@mq2],<br> []},<br>
{rabbit_mgmt_wm_overview,'-rabbit_mochiweb_contexts/0-lc$^0/1-0-',1,<br> []},<br> {rabbit_mgmt_wm_overview,rabbit_mochiweb_contexts,0,[]},<br> {rabbit_mgmt_wm_overview,to_json,2,[]},<br>
{webmachine_resource,resource_call,3,[]},<br> {webmachine_resource,do,3,[]},<br> {webmachine_decision_core,resource_call,1,[]},<br> {webmachine_decision_core,decision,1,[]}]}}<br><br>=ERROR REPORT==== 22-Oct-2012::09:40:28 ===<br>
webmachine error: path="/api/overview"<br>{error,<br> {error,function_clause,<br> [{rabbit_mgmt_wm_overview,'-contexts/1-lc$^0/1-0-',<br> [{badrpc,nodedown},rabbit@mq2],<br> []},<br>
{rabbit_mgmt_wm_overview,'-rabbit_mochiweb_contexts/0-lc$^0/1-0-',1,<br> []},<br> {rabbit_mgmt_wm_overview,rabbit_mochiweb_contexts,0,[]},<br> {rabbit_mgmt_wm_overview,to_json,2,[]},<br>
{webmachine_resource,resource_call,3,[]},<br> {webmachine_resource,do,3,[]},<br> {webmachine_decision_core,resource_call,1,[]},<br> {webmachine_decision_core,decision,1,[]}]}}<br><< several more of these >><br>
<br>The rabbit@mq1-sasl.log has one of these per queue:<br>=CRASH REPORT==== 22-Oct-2012::09:40:28 ===<br> crasher:<br> initial call: gen:init_it/6<br> pid: <0.260.0><br> registered_name: []<br> exception exit: {function_clause,<br>
[{gm,handle_info,<br> [{mnesia_locker,rabbit@mq2,granted},<br> {state,<br> {3,<0.260.0>},<br> {{3,<0.260.0>},undefined},<br>
{{3,<0.260.0>},undefined},<br> {resource,<<"/">>,queue,<<"charon">>},<br> rabbit_mirror_queue_coordinator,<br>
{12,<br> [{{3,<0.260.0>},<br> {view_member,<br> {3,<0.260.0>},<br> [],<br> {3,<0.260.0>},<br>
{3,<0.260.0>}}}]},<br> 972,[],<br> [<0.1073.0>],<br> {[],[]},<br> [],undefined}],<br>
[]},<br> {gen_server2,handle_msg,2,[]},<br> {proc_lib,wake_up,3,<br> [{file,"proc_lib.erl"},{line,237}]}]}<br> in function gen_server2:terminate/3 <br>
ancestors: [<0.259.0>,rabbit_mirror_queue_slave_sup,rabbit_sup,<br> <0.145.0>]<br> messages: []<br> links: [<0.1073.0>]<br> dictionary: [{random_seed,{986,23084,10363}}]<br>
trap_exit: false<br> status: running<br> heap_size: 1597<br> stack_size: 24<br> reductions: 263704<br> neighbours:<br> neighbour: [{pid,<0.1073.0>},<br> {registered_name,[]},<br>
{initial_call,{gen,init_it,<br> ['Argument__1','Argument__2',<br> 'Argument__3','Argument__4',<br>
'Argument__5','Argument__6']}},<br> {current_function,{gen,do_call,4}},<br> {ancestors,[<0.259.0>,rabbit_mirror_queue_slave_sup,<br>
rabbit_sup,<0.145.0>]},<br> {messages,[]},<br> {links,[<0.259.0>,<0.260.0>]},<br> {dictionary,[]},<br> {trap_exit,false},<br>
{status,waiting},<br> {heap_size,377},<br> {stack_size,27},<br> {reductions,23907}]<br><br>Meanwhile, on mq2, the rabbit@mq2.log file is:<br><br>=INFO REPORT==== 22-Oct-2012::09:40:28 ===<br>
rabbit on node rabbit@mq1 down<br><br>=ERROR REPORT==== 22-Oct-2012::09:40:28 ===<br>Mnesia(rabbit@mq2): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit@mq1}<br><br>=INFO REPORT==== 22-Oct-2012::09:40:28 ===<br>
Mirrored-queue (queue 'cmcmd' in vhost '/'): Slave <rabbit@mq2.2.260.0> saw deaths of mirrors <rabbit@mq1.1.261.0> <br><br>=INFO REPORT==== 22-Oct-2012::09:40:28 ===<br>Mirrored-queue (queue 'cmcmd' in vhost '/'): Promoting slave <rabbit@mq2.2.260.0> to master<br>
<br>=INFO REPORT==== 22-Oct-2012::09:40:28 ===<br>Mirrored-queue (queue 'charon' in vhost '/'): Slave <rabbit@mq2.2.258.0> saw deaths of mirrors <rabbit@mq1.1.259.0> <br><br>=INFO REPORT==== 22-Oct-2012::09:40:28 ===<br>
Mirrored-queue (queue 'charon' in vhost '/'): Promoting slave <rabbit@mq2.2.258.0> to master<br><br>=INFO REPORT==== 22-Oct-2012::09:40:28 ===<br> application: rabbitmq_management<br> exited: shutdown<br>
type: permanent<br><br>and the rabbit@mq2-sasl.log file has:<br><br>=SUPERVISOR REPORT==== 22-Oct-2012::09:40:28 ===<br> Supervisor: {local,rabbit_mgmt_sup}<br> Context: child_terminated<br> Reason: {shutdown,<br>
[{killed,<br> {child,undefined,rabbit_mgmt_db,<br> {rabbit_mgmt_db,start_link,[]},<br> permanent,4294967295,worker,<br>
[rabbit_mgmt_db]}}]}<br> Offender: [{pid,<0.303.0>},<br> {name,mirroring},<br> {mfa,<br> {mirrored_supervisor,start_internal,<br>
[rabbit_mgmt_sup,<br> [{rabbit_mgmt_db,<br> {rabbit_mgmt_db,start_link,[]},<br> permanent,4294967295,worker,<br>
[rabbit_mgmt_db]}]]}},<br> {restart_type,permanent},<br> {shutdown,4294967295},<br> {child_type,worker}]<br><br><br>=SUPERVISOR REPORT==== 22-Oct-2012::09:40:28 ===<br>
Supervisor: {local,rabbit_mgmt_sup}<br> Context: shutdown<br> Reason: reached_max_restart_intensity<br> Offender: [{pid,<0.303.0>},<br> {name,mirroring},<br> {mfa,<br>
{mirrored_supervisor,start_internal,<br> [rabbit_mgmt_sup,<br> [{rabbit_mgmt_db,<br> {rabbit_mgmt_db,start_link,[]},<br>
permanent,4294967295,worker,<br> [rabbit_mgmt_db]}]]}},<br> {restart_type,permanent},<br> {shutdown,4294967295},<br> {child_type,worker}]<br>