[rabbitmq-discuss] Two nodes in a cluster losing sight of each other?

Wed Oct 24 17:22:09 BST 2012

Hi Matt

On 24 Oct 2012, at 17:19, Matt Pietrek wrote:

> Thanks Tim. I'd come across the net_ticktime before but wasn't sure it was germane to this issue. We don't touch this value, so it would seem our VM would have to be busy for 60 seconds. Seems somewhat unlikely (it's dedicated to hosting MQ and nothing else), but not completely impossible. I'll continue to dig in this direction.
> 

Well that's fair enough, but if overload isn't the cause then either the network was being *very* slow or there was a real partition. That mnesia message only comes when distributed Erlang notices that the mnesia nodes have been separate in the past and then are later re-joined.

Cheers,
Tim

> Matt
> 
> On Wed, Oct 24, 2012 at 2:05 AM, Tim Watson <tim at rabbitmq.com> wrote:
> Hi Matt,
> 
> On 24 Oct 2012, at 02:22, Matt Pietrek wrote:
> 
> > I'm trying to track down a fun one. This is with 2.8.6. (We're in the process of moving these guys to 2.8.7, but want to understand what's happening first.)
> >
> > We have two nodes, mq1 and mq2. They simultaneously lose communication with each other,  breaking the cluster, although they still continue to function independently. That is, each one things the other is down.
> >
> > Now the obvious solution is some sort of network partition. However, in all of our extensive logs and by pouring over all sorts of system data, I don't see any evidence of a a network blip. Not saying it's not possible, just pretty unlikely. The only thing of note I can think of is that we were in towards the end an "apt-get update" when this happened.
> >
> 
> Whilst it looks like there's been a network partition here (indeed the mnesia running_partitioned_network message is pretty explicit about what it thinks has happened), there could be another explanation. If either node is heavily loaded, it is possible that the erlang net kernel cannot get a response back from the other node quickly enough, causing the distribution sub system to see the other node as unreachable (which is indistinguishable from 'down'). If this is what is happening, then you *could* tweak the net_ticktime and give it a higher setting, allowing the net kernel more time to potentially see a response from the other node. This is *not* a panacea however, and can have other consequences as all the rabbits in your cluster will take longer to notice if another node goes down - use with caution!
> 
> Cheers,
> Tim
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
> 
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss