[rabbitmq-discuss] routing threads on a rabbitmq node

Mon Feb 1 14:42:15 GMT 2010

Hi Brian,

On Sun, Jan 31, 2010 at 10:41:01PM -0800, Brian Sullivan wrote:
> I am curious if anyone on the rabbitmq team can confirm/clarify what we are
> seeing with respect to some throughput issues on our RMQ cluster.
> 
> The config:
> - 2-node RMQ cluster, running a topic-based exchange
> - 8 publishers, running on different hosts
> - dozens of consumers, ~75 wildcard topic bindings, mostly running on
> different hosts (there are a couple running on the RMQ hosts for stats, etc)
> 
> The issue:
> When we publish at a higher rate than normal, there appears to be a
> significant delay in the pipeline between when we publish the messages and
> when we receive them on the consuming side.

Although what you later say about memory growth suggests it's not this,
it could be some sort of buffering or nagle algorithm which is causing
batching of more messages, up until some buffer is full, before passing
them onto the network. On the other hand, if you're seeing memory growth
heavily in RabbitMQ-server itself, then that suggests it's nowt to do
with buffering in the clients.

> Since publishing is
> asynchronous, the publisher applications send as fast as they can, meanwhile
> we see an increasing delay in when we see those same messages come out on
> the other side.  My guess (gathered from
> http://www.rabbitmq.com/faq.html#node-per-CPU-core) is that there is either
> a single routing thread per publisher (channel), or even worse a single
> routing bottleneck per node.  Either way, this thread cannot route fast
> enough in a topic exchange (we have about 75 bindings, using wildcards) and
> there is a backup of messages to be routed.

Each channel can only route one message at a time. The topic exchanges,
with wildcards are inefficient, and are O(N) where N is the number of
bindings. This is sub optimal - there are ways in which we are planning
on fixing this, we've just not got around to implementing this yet.
However, if you really just have approx 75 bindings with wildcards in
total, I'm somewhat astonished this can be causing issues. What kind of
rates are you publishing at?

> The question:
> Can you please elaborate on where the routing backup could be occurring, and
> what steps might be best to prevent this from happening?  It appears from
> the fact that I am waiting on the routing to happen that using flags like
> "mandatory" on messages is not going to help me here (though I have not
> tested this).

I suspect it's in the channel processes. I'm not really sure what you
could do to help, but could you provide some more information please?:

1) How many msgs/second are being published for this issue to occur?
2) How big are those messages?
3) Can you give an example of the routing key used?
4) How many queues do messages end up in, on average?
5) Are the consumers setting qos, and are they using subscriptions or
just basic.get? What about acknowledgements?

> One idea:
> If it is truly the case that a single thread per node might be causing this
> problem, then perhaps we can run a small rabbitmq node on each publisher
> (joined to the cluster), with the sole purpose of doing the routing load?
> If we publish locally, all it would need to do is keep up with it's own
> routing load, not the combine routing load of 3 other publishers.  It
> doesn't really prevent the problem from happening though, if I can produce
> messages faster in a single thread than even a dedicated node can route.
> Would this even help?

Yeah, it may help, but without some more details, I'm not quite sure
just yet what to suggest.

Matthew