[rabbitmq-discuss] When rabbitmq is clustered with one other node we see a very slow dequeue of messages

Wed Dec 4 11:41:13 GMT 2013

The issue is that you are getting into a situation where node B is down, 
but node A is not aware of this (probably because from a TCP level it's 
not aware that the connection has been closed). Node A therefore has to 
wait a considerable time (the net_ticktime) trying to send packets to 
node B before giving up and treating the node as down.

If node A can tell at the TCP level that the connection to node B has 
gone down, then you won't have this wait, it'll just mark the node as 
down immediately and carry on.

To some extent you can tweak this behaviour by reducing net_ticktime - 
but a short net_ticktime makes it plausible that a node will be 
considered down when it isn't.

See http://www.rabbitmq.com/partitions.html for more.

Cheers, Simon

On 04/12/13 01:52, GENTLING Gregory wrote:
> Classification: Open
>
> When rabbitmq is clustered with one other node we see a very slow
> dequeue of messages. The scenario is simple, Node A and Node B in the
> cluster. They are clustered with the auto_heal option and default
> netticktime. Steps to repeat are:
>
> (These are all local connections)
>
> (1)Connect client A1 to Node A
>
> a.Client A1 creates a topic exchange
>
> b.Client A1 is a publisher with 1msg/sec
>
> (2)Connect client A2 to Node A
>
> a.Client A1 listens for the messages in the exchange
>
> (3)Connect client B to Node B   (this is important, the issue does not
> occur unless you have this remote client)
>
> a.Client B listens for the messages in the exchange
>
> (4)Pull the plug on Node B (you will not see the issue with a graceful
> shutdown), alternately you can just use “route” to now make Node B not
> routable from Node A
>
> a.If you kill rabbitmq, you will not see the issue
>
> (5)Wait for netticktime (or until you see NodeB being removed from the
> cluster in Node A’s log)
>
> (6)Client A2 no longer receives messages at 1msg/sec, it will fall
> considerably behind but recover in about 10 mins.
>
> We have two setups with slightly different network setups (two pairs of
> Node A and B). One we see this issue on, the other we do not, so this is
> not an issue that can be always reproduced.
>
> Other issues observed in this state:
>
> ·rabbitmqctl cluster_status/list_queues/list_connections/list_exchanges
> all hang, rabbitmq status does not hang
>
> ·declareQueue, declareExchange, declareExchangePassive all hang
>
> ·disabling auto_heal does not help
>
> ·tested with both Erlang 5.9 and 5.10.3
>
> ·tested with both RabbitMq 3.1.5 and 3.1.3, same issue in both
>
> ·don’t see this issue with direct exchange
>
> ·nothing in vmstat out of the ordinary, CPU is not pegged, system is not
> thrashing
>
> Things we have ruled out:
>
> ·Iptables, tested with no rules
>
> ·Selinux, tested in permissive
>
> ·Java drivers
>
> Same thing as described here:
>
> http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2013-June/027674.html
>
> Thank you,
>
> *Greg Gentling*
>
> Principal Software Architecture - Avant CommonApps
>
> Thales Avionics, Inc.
>
> In-Flight Entertainment and Connectivity
>
> Irvine, CA 92618
>
> 949-595-4943
>
> [@@OPEN@@]
>
> This email was classified by GENTLING Gregoryon Tuesday, December 03,
> 2013 5:52:25 PM.
>
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>

-- 
Simon MacMullen
RabbitMQ, Pivotal