[rabbitmq-discuss] Autoheal failure

Matt Pietrek mpietrek at skytap.com
Tue Feb 11 17:30:55 GMT 2014


Simon,

Glad you were able to repro mgmt database thing.

As for the prior Autoheal activity, here's a sanitized version of what I've
got by grepping for "Autoheal", with a little bit of extra context (-C 10).
Let me know if you need something else in particular.

On the "winning" side ("goodnessmq1")

---------------------------------

=INFO REPORT==== 2-Feb-2014::07:28:18 ===

closing AMQP connection <0.25083.0> (goodness2mq1.foo.bar.com:36360 ->
goodnessmqvip1.foo.bar.com:5672)


=WARNING REPORT==== 2-Feb-2014::07:28:18 ===

closing AMQP connection <0.19158.0> (goodnessawsd1.foo.bar.com:52133 ->
goodnessmqvip1.foo.bar.com:5672):

connection_closed_abruptly


=INFO REPORT==== 2-Feb-2014::07:28:20 ===

Autoheal request received from rabbit at goodnessmq2


=ERROR REPORT==== 2-Feb-2014::07:28:20 ===

Mnesia(rabbit at goodnessmq1): ** ERROR ** mnesia_event got
{inconsistent_database, running_partitioned_network, rabbit at goodnessmq2}


=INFO REPORT==== 2-Feb-2014::07:28:20 ===

Autoheal decision

  * Partitions: [[rabbit at goodnessmq2],[rabbit at goodnessmq1]]

  * Winner:     rabbit at goodnessmq1

  * Losers:     [rabbit at goodnessmq2]


=INFO REPORT==== 2-Feb-2014::07:28:20 ===

Autoheal request sent to rabbit at goodnessmq1


=INFO REPORT==== 2-Feb-2014::07:28:20 ===

Autoheal: I am the winner, waiting for [rabbit at goodnessmq2] to stop


=INFO REPORT==== 2-Feb-2014::07:28:20 ===

Autoheal request received from rabbit at goodnessmq1 when in state
{winner_waiting,


[rabbit at goodnessmq2],


[rabbit at goodnessmq2]}; ignoring


=INFO REPORT==== 2-Feb-2014::07:28:20 ===

accepting AMQP connection <0.25103.0> (goodness1mq1.foo.bar.com:57781 ->
goodnessmqvip1.foo.bar.com:5672)


=ERROR REPORT==== 2-Feb-2014::07:28:20 ===

connection <0.25103.0>, channel 1 - soft error:

{amqp_error,not_found,

            "home node 'rabbit at goodnessmq2' of durable queue 'federation:
skytap -> mq_prod/goodness1' in vhost '/' is down or inaccessible",

--

            "home node 'rabbit at goodnessmq2' of durable queue 'federation:
skytap -> mq_prod/goodness2' in vhost '/' is down or inaccessible",

            'queue.declare'}


=INFO REPORT==== 2-Feb-2014::07:28:24 ===

closing AMQP connection <0.25176.0> (goodness2mq1.foo.bar.com:59858 ->
goodnessmqvip1.foo.bar.com:5672)


=INFO REPORT==== 2-Feb-2014::07:28:25 ===

rabbit on node rabbit at goodnessmq2 down


=INFO REPORT==== 2-Feb-2014::07:28:25 ===

Autoheal: final node has stopped, starting...


=INFO REPORT==== 2-Feb-2014::07:28:25 ===

rabbit on node rabbit at goodnessmq2 up


=INFO REPORT==== 2-Feb-2014::07:28:25 ===

accepting AMQP connection <0.25238.0> (goodnessawsd1.foo.bar.com:52215 ->
goodnessmqvip1.foo.bar.com:5672)


=INFO REPORT==== 2-Feb-2014::07:28:26 ===

accepting AMQP connection <0.25266.0> (goodness1mq1.foo.bar.com:36928 ->
goodnessmqvip1.foo.bar.com:5672)


--

=INFO REPORT==== 2-Feb-2014::08:55:06 ===

rabbit on node rabbit at goodnessmq2 down


=ERROR REPORT==== 2-Feb-2014::08:55:06 ===

Mnesia(rabbit at goodnessmq1): ** ERROR ** mnesia_event got
{inconsistent_database, running_partitioned_network, rabbit at goodnessmq2}


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

closing AMQP connection <0.530.4> (goodnessawsd2.foo.bar.com:37902 ->
goodnessmqvip1.foo.bar.com:5672)


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Autoheal request sent to rabbit at goodnessmq1


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Autoheal request received from rabbit at goodnessmq2


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

closing AMQP connection <0.777.4> (goodnessawsd2.foo.bar.com:37917 ->
goodnessmqvip1.foo.bar.com:5672)


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Mirrored-queue (queue 'ConfigurationManager' in vhost '/'): Master
<rabbit at goodnessmq1.2.297.0> saw deaths of mirrors
<rabbit at goodnessmq2.1.11858.0>


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Mirrored-queue (queue 'accounting' in vhost '/'): Master
<rabbit at goodnessmq1.2.292.0> saw deaths of mirrors
<rabbit at goodnessmq2.1.11848.0>


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Mirrored-queue (queue 'awsdriver_cron' in vhost '/'): Master
<rabbit at goodnessmq1.2.289.0> saw deaths of mirrors
<rabbit at goodnessmq2.1.11842.0>


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Mirrored-queue (queue 'mqcmd' in vhost '/'): Master
<rabbit at goodnessmq1.2.290.0> saw deaths of mirrors
<rabbit at goodnessmq2.1.11844.0>


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Autoheal decision

  * Partitions: [[rabbit at goodnessmq1],[rabbit at goodnessmq2]]

  * Winner:     rabbit at goodnessmq1

  * Losers:     [rabbit at goodnessmq2]


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Autoheal request received from rabbit at goodnessmq1


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Mirrored-queue (queue 'system_test' in vhost '/'): Master
<rabbit at goodnessmq1.2.298.0> saw deaths of mirrors
<rabbit at goodnessmq2.1.11860.0>


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Mirrored-queue (queue 'atropos' in vhost '/'): Master
<rabbit at goodnessmq1.2.293.0> saw deaths of mirrors
<rabbit at goodnessmq2.1.11850.0>


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Mirrored-queue (queue 'cmcmd' in vhost '/'): Master
<rabbit at goodnessmq1.2.296.0> saw deaths of mirrors
<rabbit at goodnessmq2.1.11856.0>


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Mirrored-queue (queue 'greenbox' in vhost '/'): Master
<rabbit at goodnessmq1.2.295.0> saw deaths of mirrors
<rabbit at goodnessmq2.1.11854.0>


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Mirrored-queue (queue 'web-tasks-prod' in vhost '/'): Master
<rabbit at goodnessmq1.2.291.0> saw deaths of mirrors
<rabbit at goodnessmq2.1.11846.0>


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Mirrored-queue (queue 'quota_inspector' in vhost '/'): Master
<rabbit at goodnessmq1.2.294.0> saw deaths of mirrors
<rabbit at goodnessmq2.1.11852.0>


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Autoheal decision

  * Partitions: [[rabbit at goodnessmq1],[rabbit at goodnessmq2]]

  * Winner:     rabbit at goodnessmq1

  * Losers:     [rabbit at goodnessmq2]


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Autoheal: I am the winner, waiting for [rabbit at goodnessmq2] to stop


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Autoheal: I am the winner, waiting additionally for [rabbit at goodnessmq2] to
stop


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Mirrored-queue (queue 'awsdriver' in vhost '/'): Master
<rabbit at goodnessmq1.2.288.0> saw deaths of mirrors
<rabbit at goodnessmq2.1.11840.0>


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

closing AMQP connection <0.998.4> (goodnessawsd2.foo.bar.com:37932 ->
goodnessmqvip1.foo.bar.com:5672)


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

accepting AMQP connection <0.1025.4> (goodnessawsd2.foo.bar.com:37933 ->
goodnessmqvip1.foo.bar.com:5672)


--

{error,unknown_host}


=ERROR REPORT==== 3-Feb-2014::09:48:55 ===

closing AMQP connection <0.25414.14> (goodnesswfe2.foo.bar.com:37330 ->
goodnessmqvip1.foo.bar.com:5672):

{heartbeat_timeout,running}


=ERROR REPORT==== 3-Feb-2014::09:48:56 ===

Mnesia(rabbit at goodnessmq1): ** ERROR ** mnesia_event got
{inconsistent_database, running_partitioned_network, rabbit at goodnessmq2}


=INFO REPORT==== 3-Feb-2014::09:48:56 ===

Autoheal request received from rabbit at goodnessmq2 when in state
{winner_waiting,


[rabbit at goodnessmq2],


[rabbit at goodnessmq2]}; ignoring


=INFO REPORT==== 3-Feb-2014::09:48:56 ===

global: Name conflict terminating {rabbit_mgmt_db,<2783.10073.5>}


=WARNING REPORT==== 3-Feb-2014::09:48:56 ===

Federation exchange 'skytap' in vhost '/' did not connect to exchange
'skytap' in vhost '/' on amqp://goodness1mqvip1.foo.bar.com:5672

{error,unknown_host}


--

{error,unknown_host}


=WARNING REPORT==== 3-Feb-2014::13:14:20 ===

Federation exchange 'skytap' in vhost '/' did not connect to exchange
'skytap' in vhost '/' on amqp://goodness1mqvip1.foo.bar.com:5672

{error,unknown_host}


=ERROR REPORT==== 3-Feb-2014::13:14:24 ===

Mnesia(rabbit at goodnessmq1): ** ERROR ** mnesia_event got
{inconsistent_database, running_partitioned_network, rabbit at goodnessmq2}


=INFO REPORT==== 3-Feb-2014::13:14:24 ===

Autoheal request received from rabbit at goodnessmq2 when in state
{winner_waiting,


[rabbit at goodnessmq2],


[rabbit at goodnessmq2]}; ignoring


=WARNING REPORT==== 3-Feb-2014::13:14:25 ===

Federation exchange 'skytap' in vhost '/' did not connect to exchange
'skytap' in vhost '/' on amqp://goodness2mqvip1.foo.bar.com:5672

{error,unknown_host}


=WARNING REPORT==== 3-Feb-2014::13:14:25 ===

Federation exchange 'skytap' in vhost '/' did not connect to exchange
'skytap' in vhost '/' on amqp://sgp1r1mqvip1.foo.bar.com:5672

{error,unknown_host}

17:19 PROD highland at goodnessmq1:~/logs/goodnessmq1.foo.bar.com$
---------------------------------

And on the losing side ("goodnessmq2"):

---------------------------------

=WARNING REPORT==== 2-Feb-2014::07:28:08 ===

Federation exchange 'skytap' in vhost '/' did not connect to exchange
'skytap' in vhost '/' on amqp://sgp1r1mqvip1.foo.bar.com:5672

{error,unknown_host}


=ERROR REPORT==== 2-Feb-2014::07:28:20 ===

Mnesia(rabbit at goodnessmq2): ** ERROR ** mnesia_event got
{inconsistent_database, running_partitioned_network, rabbit at goodnessmq1}


=INFO REPORT==== 2-Feb-2014::07:28:20 ===

Autoheal request sent to rabbit at goodnessmq1


=WARNING REPORT==== 2-Feb-2014::07:28:20 ===

Autoheal: we were selected to restart; winner is rabbit at goodnessmq1


=INFO REPORT==== 2-Feb-2014::07:28:20 ===

Stopping RabbitMQ


=INFO REPORT==== 2-Feb-2014::07:28:20 ===

stopped TCP Listener on 0.0.0.0:5672


=WARNING REPORT==== 2-Feb-2014::07:28:21 ===

Federation exchange 'skytap' in vhost '/' did not connect to exchange
'skytap' in vhost '/' on amqp://sgp1r1mqvip1.foo.bar.com:5672

{error,unknown_host}

--

=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Mirrored-queue (queue 'ConfigurationManager' in vhost '/'): Slave
<rabbit at goodnessmq2.1.11858.0> saw deaths of mirrors
<rabbit at goodnessmq1.2.297.0>


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Mirrored-queue (queue 'ConfigurationManager' in vhost '/'): Promoting slave
<rabbit at goodnessmq2.1.11858.0> to master


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Statistics database started.


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Autoheal request sent to rabbit at goodnessmq1


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Mirrored-queue (queue 'awsdriver_cron' in vhost '/'): Slave
<rabbit at goodnessmq2.1.11842.0> saw deaths of mirrors
<rabbit at goodnessmq1.2.289.0>


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Mirrored-queue (queue 'awsdriver_cron' in vhost '/'): Promoting slave
<rabbit at goodnessmq2.1.11842.0> to master


=WARNING REPORT==== 2-Feb-2014::08:55:06 ===

Autoheal: we were selected to restart; winner is rabbit at goodnessmq1


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Stopping RabbitMQ


=INFO REPORT==== 2-Feb-2014::08:55:06 ===

Federation exchange 'skytap' in vhost '/' connected to exchange 'skytap' in
vhost '/' on amqp://goodnessmqvip1.foo.bar.com:5672


=INFO REPORT==== 2-Feb-2014::08:55:07 ===

Federation exchange 'skytap' in vhost '/' connected to exchange 'skytap' in
vhost '/' on amqp://goodnessmqvip1.foo.bar.com:5672


--

{error,unknown_host}


=WARNING REPORT==== 3-Feb-2014::09:48:55 ===

Federation exchange 'skytap' in vhost '/' did not connect to exchange
'skytap' in vhost '/' on amqp://goodnessmqvip1.foo.bar.com:5672

{error,unknown_host}


=ERROR REPORT==== 3-Feb-2014::09:48:56 ===

Mnesia(rabbit at goodnessmq2): ** ERROR ** mnesia_event got
{inconsistent_database, running_partitioned_network, rabbit at goodnessmq1}


=INFO REPORT==== 3-Feb-2014::09:48:56 ===

Autoheal request sent to rabbit at goodnessmq1


=WARNING REPORT==== 3-Feb-2014::09:48:56 ===

Federation exchange 'skytap' in vhost '/' did not connect to exchange
'skytap' in vhost '/' on amqp://goodnessmqvip1.foo.bar.com:5672

{error,unknown_host}


=INFO REPORT==== 3-Feb-2014::09:48:56 ===

Statistics database started.


=WARNING REPORT==== 3-Feb-2014::09:48:58 ===

Federation exchange 'skytap' in vhost '/' did not connect to exchange
'skytap' in vhost '/' on amqp://goodnessmqvip1.foo.bar.com:5672

--

{error,unknown_host}


=WARNING REPORT==== 3-Feb-2014::13:14:24 ===

Federation exchange 'skytap' in vhost '/' did not connect to exchange
'skytap' in vhost '/' on amqp://goodnessmqvip1.foo.bar.com:5672

{error,ehostunreach}


=ERROR REPORT==== 3-Feb-2014::13:14:24 ===

Mnesia(rabbit at goodnessmq2): ** ERROR ** mnesia_event got
{inconsistent_database, running_partitioned_network, rabbit at goodnessmq1}


=INFO REPORT==== 3-Feb-2014::13:14:24 ===

Autoheal request sent to rabbit at goodnessmq1


=WARNING REPORT==== 3-Feb-2014::13:14:25 ===

Federation exchange 'skytap' in vhost '/' did not connect to exchange
'skytap' in vhost '/' on amqp://goodnessmqvip1.foo.bar.com:5672

{{shutdown,

     {server_initiated_close,404,

         <<"NOT_FOUND - no exchange 'federation: skytap -> mq_prod B' in
vhost '/'">>}},

 {gen_server,call,

     [<0.6585.1>,

      {call,

          {'exchange.bind',0,<<"federation: skytap -> mq_prod B">>,
---------------------------------





On Tue, Feb 11, 2014 at 2:15 AM, Simon MacMullen <simon at rabbitmq.com> wrote:

> On 11/02/14 01:33, Matt Pietrek wrote:
>
>> Recently we started running a two node HA cluster of Rabbit 3.2.2, with
>> autoheal enabled.
>>
>> After a network partition, I noticed that autoheal didn't appear to
>> work, although the logs indicate it was tried. The first time it
>> happened, the UI in both brokers indicated the other broker was missing
>> from the cluster.
>>
>
> So the log indicates that the winning node ignored a request to start
> autohealing because it was already autohealing.
>
> It's possible that there is a problem if a network partition occurs while
> autoheal is already happening. I'll file a bug to look into this, but it
> would help if you can show me any previous logs from this node - I assume
> that earlier (probably not much earlier) in the logs there were some more
> partition warnings and autoheal events?
>
>
>  The second time this happened, the management plugin seemed to not
>> function afterwards. Most of the Web UI was unusable, i.e it wouldn't
>> tell me which nodes were running, what queues were declared, and so forth.
>>
>
> Separately there is an issue where the management database might fail to
> recover after a network partition. I just replicated that yesterday; note
> that it's not connected to autoheal.
>
> Cheers, Simon
>
> --
> Simon MacMullen
> RabbitMQ, Pivotal
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140211/b740334a/attachment.html>


More information about the rabbitmq-discuss mailing list