[rabbitmq-discuss] Autoheal failure

Ron ron.cordell at gmail.com
Tue Feb 11 02:45:33 GMT 2014


We have seen the same behavior but don't have a fix for it. In a 3 node HA cluster we sometimes see node 1 out as seen by node 2, node 2 out as seen by node 1, and node 3 thinks everything is ok. Pivotal Labs was working with us at one point and they didn't have an explanation, either. 

That being said we have had numerous issues getting a stable and reliable 3 node cluster working on Windows Server 2008R2. We don't see the stability issues in our tests with Linux but we won't be running production on Linux rabbit nodes for a couple more weeks. 

Cheers,

Ron

Sent from my iPad

> On Feb 10, 2014, at 5:33 PM, Matt Pietrek <mpietrek at skytap.com> wrote:
> 
> Recently we started running a two node HA cluster of Rabbit 3.2.2, with autoheal enabled.
> 
> After a network partition, I noticed that autoheal didn't appear to work, although the logs indicate it was tried. The first time it happened, the UI in both brokers indicated the other broker was missing from the cluster.
> 
> The second time this happened, the management plugin seemed to not function afterwards. Most of the Web UI was unusable, i.e it wouldn't tell me which nodes were running, what queues were declared, and so forth.
> 
> 
> I'm wondering if what I'm seeing below is a known issue rings any bells. Also, is their any other log output I should look at to determine success/failure?
> 
> On the "winning" side, the logs look like this. The "ignoring" part in particular is suspicious.
> 
> --------
> =ERROR REPORT==== 3-Feb-2014::09:48:56 ===
> 
> Mnesia(rabbit at goodnessmq1): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit at goodnessmq2}
> 
> 
> 
> =INFO REPORT==== 3-Feb-2014::09:48:56 ===
> 
> Autoheal request received from rabbit at goodnessmq2 when in state {winner_waiting,
> 
>                                                                [rabbit at goodnessmq2],
> 
>                                                                [rabbit at goodnessmq2]}; ignoring
> 
> 
> 
> =INFO REPORT==== 3-Feb-2014::09:48:56 ===
> 
> global: Name conflict terminating {rabbit_mgmt_db,<2783.10073.5>}
> 
> --------
> 
> 
> 
> On the "losing" side, the logs look like this:
> 
> --------
> 
> =ERROR REPORT==== 3-Feb-2014::09:48:56 ===
> 
> Mnesia(rabbit at goodnessmq2): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit at goodnessmq1}
> 
> 
> 
> =INFO REPORT==== 3-Feb-2014::09:48:56 ===
> 
> Autoheal request sent to rabbit at goodnessmq1
> 
> 
> 
> =WARNING REPORT==== 3-Feb-2014::09:48:56 ===
> 
> Federation exchange 'skytap' in vhost '/' did not connect to exchange 'skytap' in vhost '/' on amqp://something else.foo.bar.com:5672
> 
> {error,unknown_host}
> 
> =INFO REPORT==== 3-Feb-2014::09:48:56 ===
> 
> Statistics database started.
> 
> 
> 
> =WARNING REPORT==== 3-Feb-2014::09:48:58 ===
> 
> Federation exchange 'skytap' in vhost '/' did not connect to exchange 'skytap' in vhost '/' on amqp://somethingelse.foo.bar.com:5672
> 
> {error,unknown_host}
> 
> --------
> 
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140210/49e65e05/attachment.html>


More information about the rabbitmq-discuss mailing list