[rabbitmq-discuss] Two nodes in a cluster losing sight of each other?
Matt Pietrek
mpietrek at skytap.com
Wed Oct 24 02:22:10 BST 2012
I'm trying to track down a fun one. This is with 2.8.6. (We're in the
process of moving these guys to 2.8.7, but want to understand what's
happening first.)
We have two nodes, mq1 and mq2. They simultaneously lose communication with
each other, breaking the cluster, although they still continue to function
independently. That is, each one things the other is down.
Now the obvious solution is some sort of network partition. However, in all
of our extensive logs and by pouring over all sorts of system data, I don't
see any evidence of a a network blip. Not saying it's not possible, just
pretty unlikely. The only thing of note I can think of is that we were in
towards the end an "apt-get update" when this happened.
On mq1, the rabbit at mq1.log file is:
=ERROR REPORT==== 22-Oct-2012::09:40:28 ===
** Node rabbit at mq2 not responding **
** Removing (timedout) connection **
=ERROR REPORT==== 22-Oct-2012::09:40:28 ===
webmachine error: path="/api/overview"
{error,
{error,function_clause,
[{rabbit_mgmt_wm_overview,'-contexts/1-lc$^0/1-0-',
[{badrpc,nodedown},rabbit at mq2],
[]},
{rabbit_mgmt_wm_overview,'-rabbit_mochiweb_contexts/0-lc$^0/1-0-',1,
[]},
{rabbit_mgmt_wm_overview,rabbit_mochiweb_contexts,0,[]},
{rabbit_mgmt_wm_overview,to_json,2,[]},
{webmachine_resource,resource_call,3,[]},
{webmachine_resource,do,3,[]},
{webmachine_decision_core,resource_call,1,[]},
{webmachine_decision_core,decision,1,[]}]}}
=ERROR REPORT==== 22-Oct-2012::09:40:28 ===
webmachine error: path="/api/overview"
{error,
{error,function_clause,
[{rabbit_mgmt_wm_overview,'-contexts/1-lc$^0/1-0-',
[{badrpc,nodedown},rabbit at mq2],
[]},
{rabbit_mgmt_wm_overview,'-rabbit_mochiweb_contexts/0-lc$^0/1-0-',1,
[]},
{rabbit_mgmt_wm_overview,rabbit_mochiweb_contexts,0,[]},
{rabbit_mgmt_wm_overview,to_json,2,[]},
{webmachine_resource,resource_call,3,[]},
{webmachine_resource,do,3,[]},
{webmachine_decision_core,resource_call,1,[]},
{webmachine_decision_core,decision,1,[]}]}}
<< several more of these >>
The rabbit at mq1-sasl.log has one of these per queue:
=CRASH REPORT==== 22-Oct-2012::09:40:28 ===
crasher:
initial call: gen:init_it/6
pid: <0.260.0>
registered_name: []
exception exit: {function_clause,
[{gm,handle_info,
[{mnesia_locker,rabbit at mq2,granted},
{state,
{3,<0.260.0>},
{{3,<0.260.0>},undefined},
{{3,<0.260.0>},undefined},
{resource,<<"/">>,queue,<<"charon">>},
rabbit_mirror_queue_coordinator,
{12,
[{{3,<0.260.0>},
{view_member,
{3,<0.260.0>},
[],
{3,<0.260.0>},
{3,<0.260.0>}}}]},
972,[],
[<0.1073.0>],
{[],[]},
[],undefined}],
[]},
{gen_server2,handle_msg,2,[]},
{proc_lib,wake_up,3,
[{file,"proc_lib.erl"},{line,237}]}]}
in function gen_server2:terminate/3
ancestors: [<0.259.0>,rabbit_mirror_queue_slave_sup,rabbit_sup,
<0.145.0>]
messages: []
links: [<0.1073.0>]
dictionary: [{random_seed,{986,23084,10363}}]
trap_exit: false
status: running
heap_size: 1597
stack_size: 24
reductions: 263704
neighbours:
neighbour: [{pid,<0.1073.0>},
{registered_name,[]},
{initial_call,{gen,init_it,
['Argument__1','Argument__2',
'Argument__3','Argument__4',
'Argument__5','Argument__6']}},
{current_function,{gen,do_call,4}},
{ancestors,[<0.259.0>,rabbit_mirror_queue_slave_sup,
rabbit_sup,<0.145.0>]},
{messages,[]},
{links,[<0.259.0>,<0.260.0>]},
{dictionary,[]},
{trap_exit,false},
{status,waiting},
{heap_size,377},
{stack_size,27},
{reductions,23907}]
Meanwhile, on mq2, the rabbit at mq2.log file is:
=INFO REPORT==== 22-Oct-2012::09:40:28 ===
rabbit on node rabbit at mq1 down
=ERROR REPORT==== 22-Oct-2012::09:40:28 ===
Mnesia(rabbit at mq2): ** ERROR ** mnesia_event got {inconsistent_database,
running_partitioned_network, rabbit at mq1}
=INFO REPORT==== 22-Oct-2012::09:40:28 ===
Mirrored-queue (queue 'cmcmd' in vhost '/'): Slave <rabbit at mq2.2.260.0> saw
deaths of mirrors <rabbit at mq1.1.261.0>
=INFO REPORT==== 22-Oct-2012::09:40:28 ===
Mirrored-queue (queue 'cmcmd' in vhost '/'): Promoting slave
<rabbit at mq2.2.260.0> to master
=INFO REPORT==== 22-Oct-2012::09:40:28 ===
Mirrored-queue (queue 'charon' in vhost '/'): Slave <rabbit at mq2.2.258.0>
saw deaths of mirrors <rabbit at mq1.1.259.0>
=INFO REPORT==== 22-Oct-2012::09:40:28 ===
Mirrored-queue (queue 'charon' in vhost '/'): Promoting slave
<rabbit at mq2.2.258.0> to master
=INFO REPORT==== 22-Oct-2012::09:40:28 ===
application: rabbitmq_management
exited: shutdown
type: permanent
and the rabbit at mq2-sasl.log file has:
=SUPERVISOR REPORT==== 22-Oct-2012::09:40:28 ===
Supervisor: {local,rabbit_mgmt_sup}
Context: child_terminated
Reason: {shutdown,
[{killed,
{child,undefined,rabbit_mgmt_db,
{rabbit_mgmt_db,start_link,[]},
permanent,4294967295,worker,
[rabbit_mgmt_db]}}]}
Offender: [{pid,<0.303.0>},
{name,mirroring},
{mfa,
{mirrored_supervisor,start_internal,
[rabbit_mgmt_sup,
[{rabbit_mgmt_db,
{rabbit_mgmt_db,start_link,[]},
permanent,4294967295,worker,
[rabbit_mgmt_db]}]]}},
{restart_type,permanent},
{shutdown,4294967295},
{child_type,worker}]
=SUPERVISOR REPORT==== 22-Oct-2012::09:40:28 ===
Supervisor: {local,rabbit_mgmt_sup}
Context: shutdown
Reason: reached_max_restart_intensity
Offender: [{pid,<0.303.0>},
{name,mirroring},
{mfa,
{mirrored_supervisor,start_internal,
[rabbit_mgmt_sup,
[{rabbit_mgmt_db,
{rabbit_mgmt_db,start_link,[]},
permanent,4294967295,worker,
[rabbit_mgmt_db]}]]}},
{restart_type,permanent},
{shutdown,4294967295},
{child_type,worker}]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20121023/7bb8565b/attachment.htm>
More information about the rabbitmq-discuss
mailing list