[rabbitmq-discuss] RabbitMQ management down after cluster issue

Wed Feb 5 22:20:32 GMT 2014

Sorry, reposting this since the original one was sent using unregistered
email address

Johanes

---------- Forwarded message ----------
Date: 6 February 2014 09:17
Subject: RabbitMQ management down after cluster issue
To:
rabbitmq-discuss at lists.rabbitmq.com

Hi all,

We are having problem with our RabbitMQ cluster which we are not quite sure
how to debug the root cause. We have 2 nodes in a cluster and one of the
node (MQ1) had its management console down after clustering issue/network
partition. (I assume this is what happen because we ping the management
console from zabbix every minute)

Last month we experience network partition and the same node's management
plugin was down as well. Since then that node had been reinstalled with the
same RabbitMQ version and we use "auto-heal" policy when a network
partition happen.

Looking at both nodes' log here is some critical information i can gather

MQ1
=INFO REPORT==== 5-Feb-2014::11:14:49 ===
rabbit on node rabbit at mq2 down
=INFO REPORT==== 5-Feb-2014::11:14:51 ===
Statistics database started.
=INFO REPORT==== 5-Feb-2014::11:14:51 ===
Mirrored-queue (queue 'email.out.5' in vhost '/'): Slave
<rabbit at mq1.3.10297.0> saw deaths of mirrors <rabbit at mq2.2.
289.0>
.......
=ERROR REPORT==== 5-Feb-2014::11:14:52 ===
Mnesia(rabbit at mq1): ** ERROR ** mnesia_event got {inconsistent_database,
running_partitioned_network, rabbit at mq2}
=INFO REPORT==== 5-Feb-2014::11:14:52 ===
Autoheal request sent to rabbit at mq1
=INFO REPORT==== 5-Feb-2014::11:14:52 ===
Autoheal request received from rabbit at mq1
=INFO REPORT==== 5-Feb-2014::11:14:52 ===
global: Name conflict terminating {rabbit_mgmt_db,<2705.453.0>}
=ERROR REPORT==== 5-Feb-2014::11:14:52 ===
** Generic server <0.10360.0> terminating
** Last message in was {mnesia_locker,rabbit at mq2,granted}
** When Server state == {state,<0.10358.0>,<0.10359.0>,rabbit_mgmt_sup,
                            [{rabbit_mgmt_db,
                                 {rabbit_mgmt_db,start_link,[]},
                                 permanent,4294967295,worker,
                                 [rabbit_mgmt_db]}]}
** Reason for termination ==
** {unexpected_info,{mnesia_locker,rabbit at mq2,granted}}
=INFO REPORT==== 5-Feb-2014::11:14:52 ===
Autoheal decision
  * Partitions: [[rabbit at mq1],[rabbit at mq2]]
  * Winner:     rabbit at mq1
  * Losers:     [rabbit at mq2]
=INFO REPORT==== 5-Feb-2014::11:14:52 ===
Autoheal: I am the winner, waiting for [rabbit at mq2] to stop
=INFO REPORT==== 5-Feb-2014::11:14:53 ===
rabbit on node rabbit at mq2 down
=INFO REPORT==== 5-Feb-2014::11:14:58 ===
Autoheal: final node has stopped, starting...
=INFO REPORT==== 5-Feb-2014::11:16:14 ===
rabbit on node rabbit at mq2 up

MQ2
=ERROR REPORT==== 5-Feb-2014::11:14:49 ===
** Node rabbit at mq1 not responding **
** Removing (timedout) connection **
=INFO REPORT==== 5-Feb-2014::11:14:49 ===
rabbit on node rabbit at mq1 down
=INFO REPORT==== 5-Feb-2014::11:14:51 ===
Mirrored-queue (queue 'managedAmqpOutboundSms5' in vhost '/'): Master
<rabbit at mq2.2.277.0> saw deaths of mirrors <rabbit at mq1.3.10289.0>
...
=ERROR REPORT==== 5-Feb-2014::11:14:52 ===
Mnesia(rabbit at mq2): ** ERROR ** mnesia_event got {inconsistent_database,
running_partitioned_network, rabbit at mq1}
...
=INFO REPORT==== 5-Feb-2014::11:14:52 ===
Statistics database started.
=WARNING REPORT==== 5-Feb-2014::11:14:52 ===
Autoheal: we were selected to restart; winner is rabbit at mq1
=INFO REPORT==== 5-Feb-2014::11:14:52 ===
Stopping RabbitMQ
=INFO REPORT==== 5-Feb-2014::11:14:53 ===
stopped TCP Listener on [::]:5672
=ERROR REPORT==== 5-Feb-2014::11:14:59 ===
Mnesia(rabbit at mq2): ** ERROR ** mnesia_event got {inconsistent_database,
starting_partitioned_network, rabbit at mq1}
=INFO REPORT==== 5-Feb-2014::11:15:13 ===
Starting RabbitMQ 3.1.5 on Erlang R14B04
Copyright (C) 2007-2013 GoPivotal, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/
...

Is there some obvious error I should be looking at from RabbitMQ log to
find out what's happening? because the other log files do not seem to
provide meaningful information.

Here's our system setup (in case it may help)
- 2 nodes cluster on Linode
- Ubuntu 12.04.3 instance
- RabbitMQ 3.1.5
- management plugin enabled
- both nodes communicating on private LAN ip address
- both nodes is used by our app servers using spring-amqp to communicate
(might be unrelated information)

Any hints or help to debug the issue will be appreciated.

Thanks

Johanes
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140206/1ffe0d7f/attachment.html>