[rabbitmq-discuss] Autoheal failure

Tue Feb 11 04:14:02 GMT 2014

On 11 February 2014 13:45, Ron <ron.cordell at gmail.com> wrote:

> We have seen the same behavior but don't have a fix for it. In a 3 node HA
> cluster we sometimes see node 1 out as seen by node 2, node 2 out as seen
> by node 1, and node 3 thinks everything is ok. Pivotal Labs was working
> with us at one point and they didn't have an explanation, either.
>
> That being said we have had numerous issues getting a stable and reliable
> 3 node cluster working on Windows Server 2008R2. We don't see the stability
> issues in our tests with Linux but we won't be running production on Linux
> rabbit nodes for a couple more weeks.
>
> Cheers,
>
> Ron
>
> Sent from my iPad
>
> On Feb 10, 2014, at 5:33 PM, Matt Pietrek <mpietrek at skytap.com> wrote:
>
> Recently we started running a two node HA cluster of Rabbit 3.2.2, with
> autoheal enabled.
>
> After a network partition, I noticed that autoheal didn't appear to work,
> although the logs indicate it was tried. The first time it happened, the UI
> in both brokers indicated the other broker was missing from the cluster.
>
> The second time this happened, the management plugin seemed to not
> function afterwards. Most of the Web UI was unusable, i.e it wouldn't tell
> me which nodes were running, what queues were declared, and so forth.
>
> One thing i learned from the one of the similar discussion in this mailing
list you can start up the web ui using
sudo rabbitmqctl eval 'application:start(rabbitmq_management).'
or
sudo rabbitmqctl eval 'application:stop(rabbitmq_management).'

At least that will give you management UI back

>
>
> I'm wondering if what I'm seeing below is a known issue rings any bells.
> Also, is their any other log output I should look at to determine
> success/failure?
>
> On the "winning" side, the logs look like this. The "ignoring" part in
> particular is suspicious.
>
> --------
>
> =ERROR REPORT==== 3-Feb-2014::09:48:56 ===
>
> Mnesia(rabbit at goodnessmq1): ** ERROR ** mnesia_event got
> {inconsistent_database, running_partitioned_network, rabbit at goodnessmq2}
>
>
> =INFO REPORT==== 3-Feb-2014::09:48:56 ===
>
> Autoheal request received from rabbit at goodnessmq2 when in state
> {winner_waiting,
>
>
> [rabbit at goodnessmq2],
>
>
> [rabbit at goodnessmq2]}; ignoring
>
>
> =INFO REPORT==== 3-Feb-2014::09:48:56 ===
>
> global: Name conflict terminating {rabbit_mgmt_db,<2783.10073.5>}
>
> --------
>
>
> On the "losing" side, the logs look like this:
>
> --------
>
> =ERROR REPORT==== 3-Feb-2014::09:48:56 ===
>
> Mnesia(rabbit at goodnessmq2): ** ERROR ** mnesia_event got
> {inconsistent_database, running_partitioned_network, rabbit at goodnessmq1}
>
>
> =INFO REPORT==== 3-Feb-2014::09:48:56 ===
>
> Autoheal request sent to rabbit at goodnessmq1
>
>
> =WARNING REPORT==== 3-Feb-2014::09:48:56 ===
>
> Federation exchange 'skytap' in vhost '/' did not connect to exchange
> 'skytap' in vhost '/' on amqp://something else.foo.bar.com:5672
>
> {error,unknown_host}
>
> =INFO REPORT==== 3-Feb-2014::09:48:56 ===
>
> Statistics database started.
>
>
> =WARNING REPORT==== 3-Feb-2014::09:48:58 ===
>
> Federation exchange 'skytap' in vhost '/' did not connect to exchange
> 'skytap' in vhost '/' on amqp://somethingelse.foo.bar.com:5672
>
> {error,unknown_host}
>
> --------
>
>
Not quite sure what's going on there since we are not using federation, but
have you checked if the "losing side" can connect to the
somethingelse.foo.bar.com at that port ? I remembered when I was playing
around with federation and clustering I had issue because one of the node
couldn't resolve the other host because it didn't exist in the /etc/hosts
or due to firewall issue

> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140211/9c723df7/attachment.html>