[rabbitmq-discuss] Nodes loosing contact with the cluster, using 2.5.1

Matthew Sackman matthew at rabbitmq.com
Wed Aug 31 10:50:02 BST 2011


Hi Otto,

On Wed, Aug 31, 2011 at 01:20:14AM -0700, Otto Bergström wrote:
> richmqcoll01:
> 
> richmqcoll02:
> 
> =ERROR REPORT==== 28-Aug-2011::15:22:48 ===
> ** Node rabbit at richmqcoll01 not responding **
> ** Removing (timedout) connection **
> 
> =INFO REPORT==== 28-Aug-2011::15:22:48 ===
> node rabbit at richmqcoll01 lost 'rabbit'
> 
> =INFO REPORT==== 28-Aug-2011::15:22:56 ===
> node rabbit at richmqcoll01 down
> 
> richmqcoll04:
> 
> =ERROR REPORT==== 28-Aug-2011::15:22:53 ===
> ** Node rabbit at richmqcoll01 not responding **
> ** Removing (timedout) connection **
> 
> =INFO REPORT==== 28-Aug-2011::15:22:53 ===
> node rabbit at richmqcoll01 lost 'rabbit'
> 
> =INFO REPORT==== 28-Aug-2011::15:23:00 ===
> node rabbit at richmqcoll01 down


Wow, that's very very interesting - that 02 and 04 thought they lost 01
but 01 didn't think it had lost 02 or 04. What's even more odd is your
report that this happens with 2.5.1 and not 2.4.X.

Having had a dig around, I've come across this -
http://erlang.org/pipermail/erlang-questions/2010-September/053336.html
- which I certainly wasn't aware of, and might potentially be causing
this. It's quite alarming, that one.

One thing you might wish to experiment with is increasing the
net_ticktime - this is the amount of time each node will wait since
hearing from another node before declaring that node dead. By default
for me it seems to be at 60 seconds, which I would have thought should
be more than sufficient, but I'm curious as to whether increasing it
would solve the problem for you.

If you find your rabbitmq-server shell script which should be in
/usr/sbin/rabbitmq-server, and make the following change *on all your
nodes*, before restarting them:

The first non-comment line should look like:

SERVER_ERL_ARGS="+K true +A30 +P 1048576 \
-kernel inet_default_connect_options [{nodelay,true}]"

Change that to:

SERVER_ERL_ARGS="+K true +A30 +P 1048576 \
-kernel inet_default_connect_options [{nodelay,true}] \
-kernel net_ticktime 120"

(Quotes and trailing \ are important).

Let me know how you get on - this isn't something we've seen and I'm
very curious as to what could be causing it.

Best wishes,

Matthew


More information about the rabbitmq-discuss mailing list