[rabbitmq-discuss] When rabbitmq is clustered with one other node we see a very slow dequeue of messages

GENTLING Gregory Gregory.GENTLING at us.thalesgroup.com
Thu Dec 5 19:00:11 GMT 2013


Classification: Thales Group Internal

Hi Simon,

The issue is not that RabbitMQ does not detect a node down in a timely fashion, it does what I expect. The behavior in question is what happens after RabbitMQ removes the node due to net_ticktime expiration. If I set net_ticktime to 20 seconds, 20 seconds goes by, Node B is removed, and then the slow message delivery occurs. Likewise, set it to 10 mins, after 10 mins, Node B is removed and the slowness occurs. Five to ten minutes after Node B is removed, the server catches up. So we are seeing degraded performance *after* Node B is removed from the cluster for up to 10 minutes. 
So much so, that even with a light load of 1MSG/sec after about 5 minutes the consumer falls behind by over 100MSGs. net_ticktime only effects when we will see the server become degraded, but not how long.

Thanks,

Greg & James

-----Original Message-----
From: Simon MacMullen [mailto:simon at rabbitmq.com]
Sent: Wednesday, December 04, 2013 3:41 AM
To: Discussions about RabbitMQ
Cc: GENTLING Gregory
Subject: Re: [rabbitmq-discuss] When rabbitmq is clustered with one other node we see a very slow dequeue of messages

The issue is that you are getting into a situation where node B is down, but node A is not aware of this (probably because from a TCP level it's not aware that the connection has been closed). Node A therefore has to wait a considerable time (the net_ticktime) trying to send packets to node B before giving up and treating the node as down.

If node A can tell at the TCP level that the connection to node B has gone down, then you won't have this wait, it'll just mark the node as down immediately and carry on.

To some extent you can tweak this behaviour by reducing net_ticktime - but a short net_ticktime makes it plausible that a node will be considered down when it isn't.

See http://www.rabbitmq.com/partitions.html for more.

Cheers, Simon

On 04/12/13 01:52, GENTLING Gregory wrote:
> Classification: Open
>
> When rabbitmq is clustered with one other node we see a very slow 
> dequeue of messages. The scenario is simple, Node A and Node B in the 
> cluster. They are clustered with the auto_heal option and default 
> netticktime. Steps to repeat are:
>
> (These are all local connections)
>
> (1)Connect client A1 to Node A
>
> a.Client A1 creates a topic exchange
>
> b.Client A1 is a publisher with 1msg/sec
>
> (2)Connect client A2 to Node A
>
> a.Client A1 listens for the messages in the exchange
>
> (3)Connect client B to Node B   (this is important, the issue does not
> occur unless you have this remote client)
>
> a.Client B listens for the messages in the exchange
>
> (4)Pull the plug on Node B (you will not see the issue with a graceful 
> shutdown), alternately you can just use "route" to now make Node B not 
> routable from Node A
>
> a.If you kill rabbitmq, you will not see the issue
>
> (5)Wait for netticktime (or until you see NodeB being removed from the 
> cluster in Node A's log)
>
> (6)Client A2 no longer receives messages at 1msg/sec, it will fall 
> considerably behind but recover in about 10 mins.
>
> We have two setups with slightly different network setups (two pairs 
> of Node A and B). One we see this issue on, the other we do not, so 
> this is not an issue that can be always reproduced.
>
> Other issues observed in this state:
>
> *rabbitmqctl
> cluster_status/list_queues/list_connections/list_exchanges
> all hang, rabbitmq status does not hang
>
> *declareQueue, declareExchange, declareExchangePassive all hang
>
> *disabling auto_heal does not help
>
> *tested with both Erlang 5.9 and 5.10.3
>
> *tested with both RabbitMq 3.1.5 and 3.1.3, same issue in both
>
> *don't see this issue with direct exchange
>
> *nothing in vmstat out of the ordinary, CPU is not pegged, system is 
> not thrashing
>
> Things we have ruled out:
>
> *Iptables, tested with no rules
>
> *Selinux, tested in permissive
>
> *Java drivers
>
> Same thing as described here:
>
> http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2013-June/027674.
> html
>
> Thank you,
>
> *Greg Gentling*
>
> Principal Software Architecture - Avant CommonApps
>
> Thales Avionics, Inc.
>
> In-Flight Entertainment and Connectivity
>
> Irvine, CA 92618
>
> 949-595-4943
>
> [@@OPEN@@]
>
> This email was classified by GENTLING Gregoryon Tuesday, December 03,
> 2013 5:52:25 PM.
>
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>


--
Simon MacMullen
RabbitMQ, Pivotal

[@@THALES GROUP INTERNAL@@]
 
This email was classified by GENTLING Gregory on Thursday, December 05, 2013 11:00:11 AM.


More information about the rabbitmq-discuss mailing list