[rabbitmq-discuss] Autoheal failure
Matt Pietrek
mpietrek at skytap.com
Tue Feb 11 17:30:55 GMT 2014
Simon,
Glad you were able to repro mgmt database thing.
As for the prior Autoheal activity, here's a sanitized version of what I've
got by grepping for "Autoheal", with a little bit of extra context (-C 10).
Let me know if you need something else in particular.
On the "winning" side ("goodnessmq1")
---------------------------------
=INFO REPORT==== 2-Feb-2014::07:28:18 ===
closing AMQP connection <0.25083.0> (goodness2mq1.foo.bar.com:36360 ->
goodnessmqvip1.foo.bar.com:5672)
=WARNING REPORT==== 2-Feb-2014::07:28:18 ===
closing AMQP connection <0.19158.0> (goodnessawsd1.foo.bar.com:52133 ->
goodnessmqvip1.foo.bar.com:5672):
connection_closed_abruptly
=INFO REPORT==== 2-Feb-2014::07:28:20 ===
Autoheal request received from rabbit at goodnessmq2
=ERROR REPORT==== 2-Feb-2014::07:28:20 ===
Mnesia(rabbit at goodnessmq1): ** ERROR ** mnesia_event got
{inconsistent_database, running_partitioned_network, rabbit at goodnessmq2}
=INFO REPORT==== 2-Feb-2014::07:28:20 ===
Autoheal decision
* Partitions: [[rabbit at goodnessmq2],[rabbit at goodnessmq1]]
* Winner: rabbit at goodnessmq1
* Losers: [rabbit at goodnessmq2]
=INFO REPORT==== 2-Feb-2014::07:28:20 ===
Autoheal request sent to rabbit at goodnessmq1
=INFO REPORT==== 2-Feb-2014::07:28:20 ===
Autoheal: I am the winner, waiting for [rabbit at goodnessmq2] to stop
=INFO REPORT==== 2-Feb-2014::07:28:20 ===
Autoheal request received from rabbit at goodnessmq1 when in state
{winner_waiting,
[rabbit at goodnessmq2],
[rabbit at goodnessmq2]}; ignoring
=INFO REPORT==== 2-Feb-2014::07:28:20 ===
accepting AMQP connection <0.25103.0> (goodness1mq1.foo.bar.com:57781 ->
goodnessmqvip1.foo.bar.com:5672)
=ERROR REPORT==== 2-Feb-2014::07:28:20 ===
connection <0.25103.0>, channel 1 - soft error:
{amqp_error,not_found,
"home node 'rabbit at goodnessmq2' of durable queue 'federation:
skytap -> mq_prod/goodness1' in vhost '/' is down or inaccessible",
--
"home node 'rabbit at goodnessmq2' of durable queue 'federation:
skytap -> mq_prod/goodness2' in vhost '/' is down or inaccessible",
'queue.declare'}
=INFO REPORT==== 2-Feb-2014::07:28:24 ===
closing AMQP connection <0.25176.0> (goodness2mq1.foo.bar.com:59858 ->
goodnessmqvip1.foo.bar.com:5672)
=INFO REPORT==== 2-Feb-2014::07:28:25 ===
rabbit on node rabbit at goodnessmq2 down
=INFO REPORT==== 2-Feb-2014::07:28:25 ===
Autoheal: final node has stopped, starting...
=INFO REPORT==== 2-Feb-2014::07:28:25 ===
rabbit on node rabbit at goodnessmq2 up
=INFO REPORT==== 2-Feb-2014::07:28:25 ===
accepting AMQP connection <0.25238.0> (goodnessawsd1.foo.bar.com:52215 ->
goodnessmqvip1.foo.bar.com:5672)
=INFO REPORT==== 2-Feb-2014::07:28:26 ===
accepting AMQP connection <0.25266.0> (goodness1mq1.foo.bar.com:36928 ->
goodnessmqvip1.foo.bar.com:5672)
--
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
rabbit on node rabbit at goodnessmq2 down
=ERROR REPORT==== 2-Feb-2014::08:55:06 ===
Mnesia(rabbit at goodnessmq1): ** ERROR ** mnesia_event got
{inconsistent_database, running_partitioned_network, rabbit at goodnessmq2}
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
closing AMQP connection <0.530.4> (goodnessawsd2.foo.bar.com:37902 ->
goodnessmqvip1.foo.bar.com:5672)
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Autoheal request sent to rabbit at goodnessmq1
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Autoheal request received from rabbit at goodnessmq2
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
closing AMQP connection <0.777.4> (goodnessawsd2.foo.bar.com:37917 ->
goodnessmqvip1.foo.bar.com:5672)
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Mirrored-queue (queue 'ConfigurationManager' in vhost '/'): Master
<rabbit at goodnessmq1.2.297.0> saw deaths of mirrors
<rabbit at goodnessmq2.1.11858.0>
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Mirrored-queue (queue 'accounting' in vhost '/'): Master
<rabbit at goodnessmq1.2.292.0> saw deaths of mirrors
<rabbit at goodnessmq2.1.11848.0>
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Mirrored-queue (queue 'awsdriver_cron' in vhost '/'): Master
<rabbit at goodnessmq1.2.289.0> saw deaths of mirrors
<rabbit at goodnessmq2.1.11842.0>
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Mirrored-queue (queue 'mqcmd' in vhost '/'): Master
<rabbit at goodnessmq1.2.290.0> saw deaths of mirrors
<rabbit at goodnessmq2.1.11844.0>
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Autoheal decision
* Partitions: [[rabbit at goodnessmq1],[rabbit at goodnessmq2]]
* Winner: rabbit at goodnessmq1
* Losers: [rabbit at goodnessmq2]
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Autoheal request received from rabbit at goodnessmq1
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Mirrored-queue (queue 'system_test' in vhost '/'): Master
<rabbit at goodnessmq1.2.298.0> saw deaths of mirrors
<rabbit at goodnessmq2.1.11860.0>
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Mirrored-queue (queue 'atropos' in vhost '/'): Master
<rabbit at goodnessmq1.2.293.0> saw deaths of mirrors
<rabbit at goodnessmq2.1.11850.0>
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Mirrored-queue (queue 'cmcmd' in vhost '/'): Master
<rabbit at goodnessmq1.2.296.0> saw deaths of mirrors
<rabbit at goodnessmq2.1.11856.0>
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Mirrored-queue (queue 'greenbox' in vhost '/'): Master
<rabbit at goodnessmq1.2.295.0> saw deaths of mirrors
<rabbit at goodnessmq2.1.11854.0>
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Mirrored-queue (queue 'web-tasks-prod' in vhost '/'): Master
<rabbit at goodnessmq1.2.291.0> saw deaths of mirrors
<rabbit at goodnessmq2.1.11846.0>
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Mirrored-queue (queue 'quota_inspector' in vhost '/'): Master
<rabbit at goodnessmq1.2.294.0> saw deaths of mirrors
<rabbit at goodnessmq2.1.11852.0>
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Autoheal decision
* Partitions: [[rabbit at goodnessmq1],[rabbit at goodnessmq2]]
* Winner: rabbit at goodnessmq1
* Losers: [rabbit at goodnessmq2]
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Autoheal: I am the winner, waiting for [rabbit at goodnessmq2] to stop
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Autoheal: I am the winner, waiting additionally for [rabbit at goodnessmq2] to
stop
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Mirrored-queue (queue 'awsdriver' in vhost '/'): Master
<rabbit at goodnessmq1.2.288.0> saw deaths of mirrors
<rabbit at goodnessmq2.1.11840.0>
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
closing AMQP connection <0.998.4> (goodnessawsd2.foo.bar.com:37932 ->
goodnessmqvip1.foo.bar.com:5672)
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
accepting AMQP connection <0.1025.4> (goodnessawsd2.foo.bar.com:37933 ->
goodnessmqvip1.foo.bar.com:5672)
--
{error,unknown_host}
=ERROR REPORT==== 3-Feb-2014::09:48:55 ===
closing AMQP connection <0.25414.14> (goodnesswfe2.foo.bar.com:37330 ->
goodnessmqvip1.foo.bar.com:5672):
{heartbeat_timeout,running}
=ERROR REPORT==== 3-Feb-2014::09:48:56 ===
Mnesia(rabbit at goodnessmq1): ** ERROR ** mnesia_event got
{inconsistent_database, running_partitioned_network, rabbit at goodnessmq2}
=INFO REPORT==== 3-Feb-2014::09:48:56 ===
Autoheal request received from rabbit at goodnessmq2 when in state
{winner_waiting,
[rabbit at goodnessmq2],
[rabbit at goodnessmq2]}; ignoring
=INFO REPORT==== 3-Feb-2014::09:48:56 ===
global: Name conflict terminating {rabbit_mgmt_db,<2783.10073.5>}
=WARNING REPORT==== 3-Feb-2014::09:48:56 ===
Federation exchange 'skytap' in vhost '/' did not connect to exchange
'skytap' in vhost '/' on amqp://goodness1mqvip1.foo.bar.com:5672
{error,unknown_host}
--
{error,unknown_host}
=WARNING REPORT==== 3-Feb-2014::13:14:20 ===
Federation exchange 'skytap' in vhost '/' did not connect to exchange
'skytap' in vhost '/' on amqp://goodness1mqvip1.foo.bar.com:5672
{error,unknown_host}
=ERROR REPORT==== 3-Feb-2014::13:14:24 ===
Mnesia(rabbit at goodnessmq1): ** ERROR ** mnesia_event got
{inconsistent_database, running_partitioned_network, rabbit at goodnessmq2}
=INFO REPORT==== 3-Feb-2014::13:14:24 ===
Autoheal request received from rabbit at goodnessmq2 when in state
{winner_waiting,
[rabbit at goodnessmq2],
[rabbit at goodnessmq2]}; ignoring
=WARNING REPORT==== 3-Feb-2014::13:14:25 ===
Federation exchange 'skytap' in vhost '/' did not connect to exchange
'skytap' in vhost '/' on amqp://goodness2mqvip1.foo.bar.com:5672
{error,unknown_host}
=WARNING REPORT==== 3-Feb-2014::13:14:25 ===
Federation exchange 'skytap' in vhost '/' did not connect to exchange
'skytap' in vhost '/' on amqp://sgp1r1mqvip1.foo.bar.com:5672
{error,unknown_host}
17:19 PROD highland at goodnessmq1:~/logs/goodnessmq1.foo.bar.com$
---------------------------------
And on the losing side ("goodnessmq2"):
---------------------------------
=WARNING REPORT==== 2-Feb-2014::07:28:08 ===
Federation exchange 'skytap' in vhost '/' did not connect to exchange
'skytap' in vhost '/' on amqp://sgp1r1mqvip1.foo.bar.com:5672
{error,unknown_host}
=ERROR REPORT==== 2-Feb-2014::07:28:20 ===
Mnesia(rabbit at goodnessmq2): ** ERROR ** mnesia_event got
{inconsistent_database, running_partitioned_network, rabbit at goodnessmq1}
=INFO REPORT==== 2-Feb-2014::07:28:20 ===
Autoheal request sent to rabbit at goodnessmq1
=WARNING REPORT==== 2-Feb-2014::07:28:20 ===
Autoheal: we were selected to restart; winner is rabbit at goodnessmq1
=INFO REPORT==== 2-Feb-2014::07:28:20 ===
Stopping RabbitMQ
=INFO REPORT==== 2-Feb-2014::07:28:20 ===
stopped TCP Listener on 0.0.0.0:5672
=WARNING REPORT==== 2-Feb-2014::07:28:21 ===
Federation exchange 'skytap' in vhost '/' did not connect to exchange
'skytap' in vhost '/' on amqp://sgp1r1mqvip1.foo.bar.com:5672
{error,unknown_host}
--
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Mirrored-queue (queue 'ConfigurationManager' in vhost '/'): Slave
<rabbit at goodnessmq2.1.11858.0> saw deaths of mirrors
<rabbit at goodnessmq1.2.297.0>
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Mirrored-queue (queue 'ConfigurationManager' in vhost '/'): Promoting slave
<rabbit at goodnessmq2.1.11858.0> to master
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Statistics database started.
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Autoheal request sent to rabbit at goodnessmq1
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Mirrored-queue (queue 'awsdriver_cron' in vhost '/'): Slave
<rabbit at goodnessmq2.1.11842.0> saw deaths of mirrors
<rabbit at goodnessmq1.2.289.0>
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Mirrored-queue (queue 'awsdriver_cron' in vhost '/'): Promoting slave
<rabbit at goodnessmq2.1.11842.0> to master
=WARNING REPORT==== 2-Feb-2014::08:55:06 ===
Autoheal: we were selected to restart; winner is rabbit at goodnessmq1
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Stopping RabbitMQ
=INFO REPORT==== 2-Feb-2014::08:55:06 ===
Federation exchange 'skytap' in vhost '/' connected to exchange 'skytap' in
vhost '/' on amqp://goodnessmqvip1.foo.bar.com:5672
=INFO REPORT==== 2-Feb-2014::08:55:07 ===
Federation exchange 'skytap' in vhost '/' connected to exchange 'skytap' in
vhost '/' on amqp://goodnessmqvip1.foo.bar.com:5672
--
{error,unknown_host}
=WARNING REPORT==== 3-Feb-2014::09:48:55 ===
Federation exchange 'skytap' in vhost '/' did not connect to exchange
'skytap' in vhost '/' on amqp://goodnessmqvip1.foo.bar.com:5672
{error,unknown_host}
=ERROR REPORT==== 3-Feb-2014::09:48:56 ===
Mnesia(rabbit at goodnessmq2): ** ERROR ** mnesia_event got
{inconsistent_database, running_partitioned_network, rabbit at goodnessmq1}
=INFO REPORT==== 3-Feb-2014::09:48:56 ===
Autoheal request sent to rabbit at goodnessmq1
=WARNING REPORT==== 3-Feb-2014::09:48:56 ===
Federation exchange 'skytap' in vhost '/' did not connect to exchange
'skytap' in vhost '/' on amqp://goodnessmqvip1.foo.bar.com:5672
{error,unknown_host}
=INFO REPORT==== 3-Feb-2014::09:48:56 ===
Statistics database started.
=WARNING REPORT==== 3-Feb-2014::09:48:58 ===
Federation exchange 'skytap' in vhost '/' did not connect to exchange
'skytap' in vhost '/' on amqp://goodnessmqvip1.foo.bar.com:5672
--
{error,unknown_host}
=WARNING REPORT==== 3-Feb-2014::13:14:24 ===
Federation exchange 'skytap' in vhost '/' did not connect to exchange
'skytap' in vhost '/' on amqp://goodnessmqvip1.foo.bar.com:5672
{error,ehostunreach}
=ERROR REPORT==== 3-Feb-2014::13:14:24 ===
Mnesia(rabbit at goodnessmq2): ** ERROR ** mnesia_event got
{inconsistent_database, running_partitioned_network, rabbit at goodnessmq1}
=INFO REPORT==== 3-Feb-2014::13:14:24 ===
Autoheal request sent to rabbit at goodnessmq1
=WARNING REPORT==== 3-Feb-2014::13:14:25 ===
Federation exchange 'skytap' in vhost '/' did not connect to exchange
'skytap' in vhost '/' on amqp://goodnessmqvip1.foo.bar.com:5672
{{shutdown,
{server_initiated_close,404,
<<"NOT_FOUND - no exchange 'federation: skytap -> mq_prod B' in
vhost '/'">>}},
{gen_server,call,
[<0.6585.1>,
{call,
{'exchange.bind',0,<<"federation: skytap -> mq_prod B">>,
---------------------------------
On Tue, Feb 11, 2014 at 2:15 AM, Simon MacMullen <simon at rabbitmq.com> wrote:
> On 11/02/14 01:33, Matt Pietrek wrote:
>
>> Recently we started running a two node HA cluster of Rabbit 3.2.2, with
>> autoheal enabled.
>>
>> After a network partition, I noticed that autoheal didn't appear to
>> work, although the logs indicate it was tried. The first time it
>> happened, the UI in both brokers indicated the other broker was missing
>> from the cluster.
>>
>
> So the log indicates that the winning node ignored a request to start
> autohealing because it was already autohealing.
>
> It's possible that there is a problem if a network partition occurs while
> autoheal is already happening. I'll file a bug to look into this, but it
> would help if you can show me any previous logs from this node - I assume
> that earlier (probably not much earlier) in the logs there were some more
> partition warnings and autoheal events?
>
>
> The second time this happened, the management plugin seemed to not
>> function afterwards. Most of the Web UI was unusable, i.e it wouldn't
>> tell me which nodes were running, what queues were declared, and so forth.
>>
>
> Separately there is an issue where the management database might fail to
> recover after a network partition. I just replicated that yesterday; note
> that it's not connected to autoheal.
>
> Cheers, Simon
>
> --
> Simon MacMullen
> RabbitMQ, Pivotal
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140211/b740334a/attachment.html>
More information about the rabbitmq-discuss
mailing list