[rabbitmq-discuss] Two nodes in a cluster losing sight of each other?

Wed Oct 24 02:22:10 BST 2012

I'm trying to track down a fun one. This is with 2.8.6. (We're in the
process of moving these guys to 2.8.7, but want to understand what's
happening first.)

We have two nodes, mq1 and mq2. They simultaneously lose communication with
each other,  breaking the cluster, although they still continue to function
independently. That is, each one things the other is down.

Now the obvious solution is some sort of network partition. However, in all
of our extensive logs and by pouring over all sorts of system data, I don't
see any evidence of a a network blip. Not saying it's not possible, just
pretty unlikely. The only thing of note I can think of is that we were in
towards the end an "apt-get update" when this happened.

On mq1, the rabbit at mq1.log file is:

=ERROR REPORT==== 22-Oct-2012::09:40:28 ===
** Node rabbit at mq2 not responding **
** Removing (timedout) connection **

=ERROR REPORT==== 22-Oct-2012::09:40:28 ===
webmachine error: path="/api/overview"
{error,
    {error,function_clause,
        [{rabbit_mgmt_wm_overview,'-contexts/1-lc$^0/1-0-',
             [{badrpc,nodedown},rabbit at mq2],
             []},

{rabbit_mgmt_wm_overview,'-rabbit_mochiweb_contexts/0-lc$^0/1-0-',1,
             []},
         {rabbit_mgmt_wm_overview,rabbit_mochiweb_contexts,0,[]},
         {rabbit_mgmt_wm_overview,to_json,2,[]},
         {webmachine_resource,resource_call,3,[]},
         {webmachine_resource,do,3,[]},
         {webmachine_decision_core,resource_call,1,[]},
         {webmachine_decision_core,decision,1,[]}]}}

=ERROR REPORT==== 22-Oct-2012::09:40:28 ===
webmachine error: path="/api/overview"
{error,
    {error,function_clause,
        [{rabbit_mgmt_wm_overview,'-contexts/1-lc$^0/1-0-',
             [{badrpc,nodedown},rabbit at mq2],
             []},

{rabbit_mgmt_wm_overview,'-rabbit_mochiweb_contexts/0-lc$^0/1-0-',1,
             []},
         {rabbit_mgmt_wm_overview,rabbit_mochiweb_contexts,0,[]},
         {rabbit_mgmt_wm_overview,to_json,2,[]},
         {webmachine_resource,resource_call,3,[]},
         {webmachine_resource,do,3,[]},
         {webmachine_decision_core,resource_call,1,[]},
         {webmachine_decision_core,decision,1,[]}]}}
<< several more of these >>

The rabbit at mq1-sasl.log has one of these per queue:
=CRASH REPORT==== 22-Oct-2012::09:40:28 ===
  crasher:
    initial call: gen:init_it/6
    pid: <0.260.0>
    registered_name: []
    exception exit: {function_clause,
                     [{gm,handle_info,
                       [{mnesia_locker,rabbit at mq2,granted},
                        {state,
                         {3,<0.260.0>},
                         {{3,<0.260.0>},undefined},
                         {{3,<0.260.0>},undefined},
                         {resource,<<"/">>,queue,<<"charon">>},
                         rabbit_mirror_queue_coordinator,
                         {12,
                          [{{3,<0.260.0>},
                            {view_member,
                             {3,<0.260.0>},
                             [],
                             {3,<0.260.0>},
                             {3,<0.260.0>}}}]},
                         972,[],
                         [<0.1073.0>],
                         {[],[]},
                         [],undefined}],
                       []},
                      {gen_server2,handle_msg,2,[]},
                      {proc_lib,wake_up,3,
                       [{file,"proc_lib.erl"},{line,237}]}]}
      in function  gen_server2:terminate/3
    ancestors: [<0.259.0>,rabbit_mirror_queue_slave_sup,rabbit_sup,
                  <0.145.0>]
    messages: []
    links: [<0.1073.0>]
    dictionary: [{random_seed,{986,23084,10363}}]
    trap_exit: false
    status: running
    heap_size: 1597
    stack_size: 24
    reductions: 263704
  neighbours:
    neighbour: [{pid,<0.1073.0>},
                  {registered_name,[]},
                  {initial_call,{gen,init_it,
                                     ['Argument__1','Argument__2',
                                      'Argument__3','Argument__4',
                                      'Argument__5','Argument__6']}},
                  {current_function,{gen,do_call,4}},
                  {ancestors,[<0.259.0>,rabbit_mirror_queue_slave_sup,
                              rabbit_sup,<0.145.0>]},
                  {messages,[]},
                  {links,[<0.259.0>,<0.260.0>]},
                  {dictionary,[]},
                  {trap_exit,false},
                  {status,waiting},
                  {heap_size,377},
                  {stack_size,27},
                  {reductions,23907}]

Meanwhile, on mq2, the rabbit at mq2.log file is:

=INFO REPORT==== 22-Oct-2012::09:40:28 ===
rabbit on node rabbit at mq1 down

=ERROR REPORT==== 22-Oct-2012::09:40:28 ===
Mnesia(rabbit at mq2): ** ERROR ** mnesia_event got {inconsistent_database,
running_partitioned_network, rabbit at mq1}

=INFO REPORT==== 22-Oct-2012::09:40:28 ===
Mirrored-queue (queue 'cmcmd' in vhost '/'): Slave <rabbit at mq2.2.260.0> saw
deaths of mirrors <rabbit at mq1.1.261.0>

=INFO REPORT==== 22-Oct-2012::09:40:28 ===
Mirrored-queue (queue 'cmcmd' in vhost '/'): Promoting slave
<rabbit at mq2.2.260.0> to master

=INFO REPORT==== 22-Oct-2012::09:40:28 ===
Mirrored-queue (queue 'charon' in vhost '/'): Slave <rabbit at mq2.2.258.0>
saw deaths of mirrors <rabbit at mq1.1.259.0>

=INFO REPORT==== 22-Oct-2012::09:40:28 ===
Mirrored-queue (queue 'charon' in vhost '/'): Promoting slave
<rabbit at mq2.2.258.0> to master

=INFO REPORT==== 22-Oct-2012::09:40:28 ===
    application: rabbitmq_management
    exited: shutdown
    type: permanent

and the rabbit at mq2-sasl.log file has:

=SUPERVISOR REPORT==== 22-Oct-2012::09:40:28 ===
     Supervisor: {local,rabbit_mgmt_sup}
     Context:    child_terminated
     Reason:     {shutdown,
                     [{killed,
                          {child,undefined,rabbit_mgmt_db,
                              {rabbit_mgmt_db,start_link,[]},
                              permanent,4294967295,worker,
                              [rabbit_mgmt_db]}}]}
     Offender:   [{pid,<0.303.0>},
                  {name,mirroring},
                  {mfa,
                      {mirrored_supervisor,start_internal,
                          [rabbit_mgmt_sup,
                           [{rabbit_mgmt_db,
                                {rabbit_mgmt_db,start_link,[]},
                                permanent,4294967295,worker,
                                [rabbit_mgmt_db]}]]}},
                  {restart_type,permanent},
                  {shutdown,4294967295},
                  {child_type,worker}]

=SUPERVISOR REPORT==== 22-Oct-2012::09:40:28 ===
     Supervisor: {local,rabbit_mgmt_sup}
     Context:    shutdown
     Reason:     reached_max_restart_intensity
     Offender:   [{pid,<0.303.0>},
                  {name,mirroring},
                  {mfa,
                      {mirrored_supervisor,start_internal,
                          [rabbit_mgmt_sup,
                           [{rabbit_mgmt_db,
                                {rabbit_mgmt_db,start_link,[]},
                                permanent,4294967295,worker,
                                [rabbit_mgmt_db]}]]}},
                  {restart_type,permanent},
                  {shutdown,4294967295},
                  {child_type,worker}]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20121023/7bb8565b/attachment.htm>