[rabbitmq-discuss] Two nodes in a cluster losing sight of each other?

Matt Pietrek mpietrek at skytap.com
Wed Oct 24 17:19:39 BST 2012


Thanks Tim. I'd come across the net_ticktime before but wasn't sure it was
germane to this issue. We don't touch this value, so it would seem our VM
would have to be busy for 60 seconds. Seems somewhat unlikely (it's
dedicated to hosting MQ and nothing else), but not completely impossible.
I'll continue to dig in this direction.

Matt

On Wed, Oct 24, 2012 at 2:05 AM, Tim Watson <tim at rabbitmq.com> wrote:

> Hi Matt,
>
> On 24 Oct 2012, at 02:22, Matt Pietrek wrote:
>
> > I'm trying to track down a fun one. This is with 2.8.6. (We're in the
> process of moving these guys to 2.8.7, but want to understand what's
> happening first.)
> >
> > We have two nodes, mq1 and mq2. They simultaneously lose communication
> with each other,  breaking the cluster, although they still continue to
> function independently. That is, each one things the other is down.
> >
> > Now the obvious solution is some sort of network partition. However, in
> all of our extensive logs and by pouring over all sorts of system data, I
> don't see any evidence of a a network blip. Not saying it's not possible,
> just pretty unlikely. The only thing of note I can think of is that we were
> in towards the end an "apt-get update" when this happened.
> >
>
> Whilst it looks like there's been a network partition here (indeed the
> mnesia running_partitioned_network message is pretty explicit about what it
> thinks has happened), there could be another explanation. If either node is
> heavily loaded, it is possible that the erlang net kernel cannot get a
> response back from the other node quickly enough, causing the distribution
> sub system to see the other node as unreachable (which is indistinguishable
> from 'down'). If this is what is happening, then you *could* tweak the
> net_ticktime and give it a higher setting, allowing the net kernel more
> time to potentially see a response from the other node. This is *not* a
> panacea however, and can have other consequences as all the rabbits in your
> cluster will take longer to notice if another node goes down - use with
> caution!
>
> Cheers,
> Tim
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20121024/063c25ca/attachment.htm>


More information about the rabbitmq-discuss mailing list