Hi Matthew,<br><br>Thanks for the speedy response. Answers below. And yes, this is all memory growth on the rabbitmq side, the clients are able to burn through their catchup logs pretty quickly, then coast along as regular rates, meanwhile the server is chewing on it's work to catch up, sometimes for *hours*.<br>
<br>1) How many msgs/second are being published for this issue to occur?<br>From a single producer, about 900 messages/sec during these burst catchup periods. Normal volumes then drop down to 300-500 mps throughout the day, which we can keep up with for the most part. Note that there are 8-9 such producers, distributed across 2 nodes.<br>
<br>
2) How big are those messages?<br>They vary in size, but in the neighborhood of 500 bytes each. Pretty small.<br><br>
3) Can you give an example of the routing key used?<br>We were originally looking to do <domain>.<eventname>, but really everything subscribes to "#.<eventname>". It's a little wasted and I think I would like to at some point switch back to a direct exchange and partition our traffic by domain another way, since we don't subscribe to things across all domains like I expected we might. There are on the order of about 100 eventnames at this point, of varying frequencies.<br>
<br>
4) How many queues do messages end up in, on average?<br>About the same number of bindings - 75. We don't do many multiple bindings per queue (if any).<br><br>
5) Are the consumers setting qos, and are they using subscriptions or just basic.get? What about acknowledgements?<br>Consumers are using the Java client, no qos settings, via subscriptions (via QueueingConsumer.nextDelivery()).<br>
<br>Acknowledgements are sent after each message is retrieved via QueueingConsumer.getChannel().basicAck(envelope.getDeliveryTag(), false);<br><br>Like I said, there is no queue backup, so I don't think it's on the consuming side. In fact, I can pull up a new client that just does a simple subscription and it will instantly start showing the current place in the routing, which could be that past hour of messages.<br>
<br>Does that help? Thanks again for digging into this with me, this has been a growing problem for us that I need to understand better to help rearchitect our configuration.<br><br>Thanks,<br>Brian<br><br><br><div class="gmail_quote">
On Mon, Feb 1, 2010 at 6:42 AM, Matthew Sackman <span dir="ltr"><<a href="mailto:matthew@lshift.net">matthew@lshift.net</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Hi Brian,<br>
<div class="im"><br>
On Sun, Jan 31, 2010 at 10:41:01PM -0800, Brian Sullivan wrote:<br>
> I am curious if anyone on the rabbitmq team can confirm/clarify what we are<br>
> seeing with respect to some throughput issues on our RMQ cluster.<br>
><br>
> The config:<br>
> - 2-node RMQ cluster, running a topic-based exchange<br>
> - 8 publishers, running on different hosts<br>
> - dozens of consumers, ~75 wildcard topic bindings, mostly running on<br>
> different hosts (there are a couple running on the RMQ hosts for stats, etc)<br>
><br>
> The issue:<br>
> When we publish at a higher rate than normal, there appears to be a<br>
> significant delay in the pipeline between when we publish the messages and<br>
> when we receive them on the consuming side.<br>
<br>
</div>Although what you later say about memory growth suggests it's not this,<br>
it could be some sort of buffering or nagle algorithm which is causing<br>
batching of more messages, up until some buffer is full, before passing<br>
them onto the network. On the other hand, if you're seeing memory growth<br>
heavily in RabbitMQ-server itself, then that suggests it's nowt to do<br>
with buffering in the clients.<br>
<div class="im"><br>
> Since publishing is<br>
> asynchronous, the publisher applications send as fast as they can, meanwhile<br>
> we see an increasing delay in when we see those same messages come out on<br>
> the other side. My guess (gathered from<br>
> <a href="http://www.rabbitmq.com/faq.html#node-per-CPU-core" target="_blank">http://www.rabbitmq.com/faq.html#node-per-CPU-core</a>) is that there is either<br>
> a single routing thread per publisher (channel), or even worse a single<br>
> routing bottleneck per node. Either way, this thread cannot route fast<br>
> enough in a topic exchange (we have about 75 bindings, using wildcards) and<br>
> there is a backup of messages to be routed.<br>
<br>
</div>Each channel can only route one message at a time. The topic exchanges,<br>
with wildcards are inefficient, and are O(N) where N is the number of<br>
bindings. This is sub optimal - there are ways in which we are planning<br>
on fixing this, we've just not got around to implementing this yet.<br>
However, if you really just have approx 75 bindings with wildcards in<br>
total, I'm somewhat astonished this can be causing issues. What kind of<br>
rates are you publishing at?<br>
<div class="im"><br>
> The question:<br>
> Can you please elaborate on where the routing backup could be occurring, and<br>
> what steps might be best to prevent this from happening? It appears from<br>
> the fact that I am waiting on the routing to happen that using flags like<br>
> "mandatory" on messages is not going to help me here (though I have not<br>
> tested this).<br>
<br>
</div>I suspect it's in the channel processes. I'm not really sure what you<br>
could do to help, but could you provide some more information please?:<br>
<br>
1) How many msgs/second are being published for this issue to occur?<br>
2) How big are those messages?<br>
3) Can you give an example of the routing key used?<br>
4) How many queues do messages end up in, on average?<br>
5) Are the consumers setting qos, and are they using subscriptions or<br>
just basic.get? What about acknowledgements?<br>
<div class="im"><br>
> One idea:<br>
> If it is truly the case that a single thread per node might be causing this<br>
> problem, then perhaps we can run a small rabbitmq node on each publisher<br>
> (joined to the cluster), with the sole purpose of doing the routing load?<br>
> If we publish locally, all it would need to do is keep up with it's own<br>
> routing load, not the combine routing load of 3 other publishers. It<br>
> doesn't really prevent the problem from happening though, if I can produce<br>
> messages faster in a single thread than even a dedicated node can route.<br>
> Would this even help?<br>
<br>
</div>Yeah, it may help, but without some more details, I'm not quite sure<br>
just yet what to suggest.<br>
<div><div></div><div class="h5"><br>
Matthew<br>
<br>
_______________________________________________<br>
rabbitmq-discuss mailing list<br>
<a href="mailto:rabbitmq-discuss@lists.rabbitmq.com">rabbitmq-discuss@lists.rabbitmq.com</a><br>
<a href="http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss" target="_blank">http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss</a><br>
</div></div></blockquote></div><br>