<br>Hi Matthias,<br><br>That&#39;s good to know that those numbers look high to you.  Knowing where our bounds are will help me readjust out configuration.<br><br>We have ~75 bindings, same as the number of queues.   We don&#39;t do many multiple bindings per queue (if any).  This has increased faster than our message volumes (more consuming applications to make use of the data), so I believe this is the primary reason things are harder now than they used to be.<br>


<br>Unfortunately, moving to a direct exchange is in the works but not a quick for us at this point.<br><br>What I would like to figure out is how to reorient my cluster to make things more stable.  Knowing that the routing time is increasing due to the number of bindings, I am not convinced that my plan of adding a rabbitmq node to each producer is going to make things all that much better - the routing table will still be the same, and it will need to do that cross-routing you&#39;re talking about avoiding.  Even when we have a single producer catching up in our current system, the node can only route at a certain rate, and this is definitely not CPU bound.  I am curious why Erlang cannot spend more time in that thread, but I don&#39;t know much about it - does that seem right to you?<br>


<br>I am not sure what I can do to minimize cross-routing, other than to try to keep our producers consolidated and keep the heaviest consumers (meaning the ones with a binding to the most active topics - remember that all queues bind to only one topic expression) separated on their own nodes, to remove their queue management processing on the core routing function.  Ironically, I was originally trying to keep the heaviest consumers on the routing nodes, to minimize forwarding of messages - but if the cost magnifies with the number of consumer queues, then it&#39;s likely that keeping the larger fanout (but smaller throughput) of consumers on the routing nodes might be best.<br>


<br>The thing that concerns me is that my scalability here seems to be limited - the only other thing I can think of doing is increasing my number of producers to distribute the load even further and possibly do the local node thing - then if our routing table keeps growing, I can manage scaling at the producer level - not efficient maybe, but at least it can grow past the threshold I appear to be running into.<br>


<br>Thanks for the background.  I would love to see more documentation on how the process model works.  Let me know if the above triggers any other solutions.<br><br>Thanks,<br>Brian<br><br><br><div class="gmail_quote">On Tue, Feb 2, 2010 at 6:00 PM, Matthias Radestock <span dir="ltr">&lt;<a href="mailto:matthias@lshift.net">matthias@lshift.net</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Brian,<div class="im"><br>

<br>

Brian Sullivan wrote:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

1) How many msgs/second are being published for this issue to occur?<br>

 From a single producer, about 900 messages/sec during these burst catchup periods.  Normal volumes then drop down to 300-500 mps throughout the day, which we can keep up with for the most part.  Note that there are 8-9 such producers, distributed across 2 nodes.<br>


</blockquote>

<br></div>

So that&#39;s a 900Hz * 9 = 8.1kHz peak inbound rate?<div class="im"><br>

<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

4) How many queues do messages end up in, on average?<br>

About the same number of bindings - 75.  We don&#39;t do many multiple bindings per queue (if any).<br>

</blockquote>

<br></div>

8.1kHz inbound with a 75x fan-out ratio would require an outbound rate of &gt;600kHz, which is way more than a two-node rabbit cluster can handle. So some backlog will certainly build up.<br>

<br>

It will take a while for messages to make it into queues. This isn&#39;t helped by lack of optimisation in two areas of the server code:<br>

<br>

- topic exchanges. As you know, they are currently totally unoptimised and the cost of determining the queues a message should be routed to is linear in the total number of bindings on the exchage. How many bindings are there in total in your case? If you can, please use a direct exchange.<br>


<br>

- cross-node routing in a cluster. A while ago we had to remove optimised cross-node routing since it turned out to break certain effect visibility guarantees required by AMQP. As a result, routing a message to N queues residing on a different node will result in N network transmissions of the message to that node, and N copies of the message at the node. If you can, don&#39;t use clustering or at least avoid configurations where producers and consumers connect to different nodes.<br>


<br>

<br>

Regards,<br><font color="#888888">

<br>

Matthias.<br>

</font></blockquote></div><br>