[rabbitmq-discuss] When rabbitmq is clustered with one other node we see a very slow dequeue of messages

Wed Dec 4 18:39:15 GMT 2013

Simon MacMullen <simon at ...> writes:
> 
> The issue is that you are getting into a situation where node B is down, 
> but node A is not aware of this (probably because from a TCP level it's 
> not aware that the connection has been closed). Node A therefore has to 
> wait a considerable time (the net_ticktime) trying to send packets to 
> node B before giving up and treating the node as down.
> 
> If node A can tell at the TCP level that the connection to node B has 
> gone down, then you won't have this wait, it'll just mark the node as 
> down immediately and carry on.
> 
> To some extent you can tweak this behaviour by reducing net_ticktime - 
> but a short net_ticktime makes it plausible that a node will be 
> considered down when it isn't.
> 
> See http://www.rabbitmq.com/partitions.html for more.
> 
> Cheers, Simon

Hi Simon,

The issue is not that RabbitMQ does not detect a node down in a timely 
fashion, it does what I expect. The behavior in question is what happens 
after RabbitMQ removes the node due to net_ticktime expiration. If I set 
net_ticktime to 20 seconds, 20 seconds goes by, Node B is removed, and then 
the slow message delivery occurs. Likewise, set it to 10 mins, after 10 
mins, Node B is removed and the slowness occurs. Five to ten minutes after 
Node B is removed, the server catches up. So we are seeing degraded 
performance *after* Node B is removed from the cluster for up to 10 minutes. 
So much so, that even with a light load of 1MSG/sec after about 5 minutes 
the consumer falls behind by over 100MSGs. net_ticktime only effects when we 
will see the server become degraded, but not how long.

Thanks,
James