[rabbitmq-discuss] Two nodes in a cluster losing sight of each other?

Tim Watson tim at rabbitmq.com
Wed Oct 24 10:05:49 BST 2012


Hi Matt,

On 24 Oct 2012, at 02:22, Matt Pietrek wrote:

> I'm trying to track down a fun one. This is with 2.8.6. (We're in the process of moving these guys to 2.8.7, but want to understand what's happening first.)
> 
> We have two nodes, mq1 and mq2. They simultaneously lose communication with each other,  breaking the cluster, although they still continue to function independently. That is, each one things the other is down.
> 
> Now the obvious solution is some sort of network partition. However, in all of our extensive logs and by pouring over all sorts of system data, I don't see any evidence of a a network blip. Not saying it's not possible, just pretty unlikely. The only thing of note I can think of is that we were in towards the end an "apt-get update" when this happened.
> 

Whilst it looks like there's been a network partition here (indeed the mnesia running_partitioned_network message is pretty explicit about what it thinks has happened), there could be another explanation. If either node is heavily loaded, it is possible that the erlang net kernel cannot get a response back from the other node quickly enough, causing the distribution sub system to see the other node as unreachable (which is indistinguishable from 'down'). If this is what is happening, then you *could* tweak the net_ticktime and give it a higher setting, allowing the net kernel more time to potentially see a response from the other node. This is *not* a panacea however, and can have other consequences as all the rabbits in your cluster will take longer to notice if another node goes down - use with caution!

Cheers,
Tim


More information about the rabbitmq-discuss mailing list