Hi,<br><br>I am curious if anyone on the rabbitmq team can confirm/clarify what we are seeing with respect to some throughput issues on our RMQ cluster.<br><br>The config:<br>- 2-node RMQ cluster, running a topic-based exchange<br>
- 8 publishers, running on different hosts<br>- dozens of consumers, ~75 wildcard topic bindings, mostly running on different hosts (there are a couple running on the RMQ hosts for stats, etc)<br><br>The issue:<br>When we publish at a higher rate than normal, there appears to be a significant delay in the pipeline between when we publish the messages and when we receive them on the consuming side. Since publishing is asynchronous, the publisher applications send as fast as they can, meanwhile we see an increasing delay in when we see those same messages come out on the other side. My guess (gathered from <a href="http://www.rabbitmq.com/faq.html#node-per-CPU-core" target="_blank">http://www.rabbitmq.com/faq.html#node-per-CPU-core</a>) is that there is either a single routing thread per publisher (channel), or even worse a single routing bottleneck per node. Either way, this thread cannot route fast enough in a topic exchange (we have about 75 bindings, using wildcards) and there is a backup of messages to be routed.<br>
<br>This is dangerous and hard to control, since we have seen memory grow in such a way that we have a hard time stopping it. In certain failure modes, when the publisher disconnects, all of the messages pending to be routed are discarded - possibly hours of data in our environment. It also limits the scaling we've been able to do, since if we fire up all 8 publishers at high volume, this backup problem spreads across all streams.<br>
<br>The question:<br>Can you please elaborate on where the routing backup could be occurring, and what steps might be best to prevent this from happening? It appears from the fact that I am waiting on the routing to happen that using flags like "mandatory" on messages is not going to help me here (though I have not tested this).<br>
<br>One idea:<br>If it is truly the case that a single thread per node might be causing this problem, then perhaps we can run a small rabbitmq node on each publisher (joined to the cluster), with the sole purpose of doing the routing load? If we publish locally, all it would need to do is keep up with it's own routing load, not the combine routing load of 3 other publishers. It doesn't really prevent the problem from happening though, if I can produce messages faster in a single thread than even a dedicated node can route. Would this even help?<br>
<br>Thanks,<br>Brian<br><br>