[rabbitmq-discuss] routing threads on a rabbitmq node

Brian Sullivan bsullivan at lindenlab.com
Mon Feb 1 06:41:01 GMT 2010


I am curious if anyone on the rabbitmq team can confirm/clarify what we are
seeing with respect to some throughput issues on our RMQ cluster.

The config:
- 2-node RMQ cluster, running a topic-based exchange
- 8 publishers, running on different hosts
- dozens of consumers, ~75 wildcard topic bindings, mostly running on
different hosts (there are a couple running on the RMQ hosts for stats, etc)

The issue:
When we publish at a higher rate than normal, there appears to be a
significant delay in the pipeline between when we publish the messages and
when we receive them on the consuming side.  Since publishing is
asynchronous, the publisher applications send as fast as they can, meanwhile
we see an increasing delay in when we see those same messages come out on
the other side.  My guess (gathered from
http://www.rabbitmq.com/faq.html#node-per-CPU-core) is that there is either
a single routing thread per publisher (channel), or even worse a single
routing bottleneck per node.  Either way, this thread cannot route fast
enough in a topic exchange (we have about 75 bindings, using wildcards) and
there is a backup of messages to be routed.

This is dangerous and hard to control, since we have seen memory grow in
such a way that we have a hard time stopping it.  In certain failure modes,
when the publisher disconnects, all of the messages pending to be routed are
discarded - possibly hours of data in our environment.  It also limits the
scaling we've been able to do, since if we fire up all 8 publishers at high
volume, this backup problem spreads across all streams.

The question:
Can you please elaborate on where the routing backup could be occurring, and
what steps might be best to prevent this from happening?  It appears from
the fact that I am waiting on the routing to happen that using flags like
"mandatory" on messages is not going to help me here (though I have not
tested this).

One idea:
If it is truly the case that a single thread per node might be causing this
problem, then perhaps we can run a small rabbitmq node on each publisher
(joined to the cluster), with the sole purpose of doing the routing load?
If we publish locally, all it would need to do is keep up with it's own
routing load, not the combine routing load of 3 other publishers.  It
doesn't really prevent the problem from happening though, if I can produce
messages faster in a single thread than even a dedicated node can route.
Would this even help?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20100131/1f783db0/attachment.htm 

More information about the rabbitmq-discuss mailing list