It seems like one of the problems with round robin is that consumers may spend<br>more time on some messages than others, so you are depending on a random<br>distribution to even out the load.<br><br>To help load balancing, could the consumers be set up to, instead of round robin,<br>
simply each try to read from a common queue, and who ever gets there first gets the message.<br>This would mean that each consumer only gets a message when they become idle,<br>which seems like what would be wanted.<br><br>
On the producer side, if there were multiple queues, the producer would want to<br>write to the queue with the least amount of messages on it.<br><br>I'm trying to learn AMQP too and this has been an interesting discussion to watch.<br>
<br>Thanks,<br><br>- Jim<br><br clear="all">Jim Irrer <a href="mailto:irrer@umich.edu">irrer@umich.edu</a> (734) 647-4409<br>University of Michigan Hospital Radiation Oncology<br>519 W. William St. Ann Arbor, MI 48103<br>
<br><br><div class="gmail_quote">On Tue, Aug 18, 2009 at 9:18 AM, Paul Dix <span dir="ltr"><<a href="mailto:paul@pauldix.net">paul@pauldix.net</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
All of that makes sense.<br>
<br>
Let me give some more specifics about what I'm building and how I'm<br>
hoping to use the messaging system. I'm doing a constant internet<br>
crawl of sorts, twitter updates and everything else are in there. So<br>
when something gets pulled down the document gets inserted into a<br>
horizontally scalable key value store in the sky. I then want to send<br>
a message through the system that this key/value has been<br>
inserted/updated. This is being done by 20-100 boxes.<br>
<br>
I then want that message to be grabbed by a consumer where some<br>
processing will happen and probably some ranking, relevance and other<br>
things get written to an index somewhere (also being done by a large<br>
number of boxes).<br>
<br>
So for this specific case I'm using a direct exchange with a single<br>
queue (no message persistence and don't bother keeping ordering).<br>
Hundreds of producers are posting messages to the exchange with the<br>
same routing key and hundreds of consumers are pulling off the queue.<br>
It's the firehose thing. Each message has to be processed once by any<br>
one of the hundreds of consumers.<br>
<br>
I guess I was hoping for the flow management part to be handled by<br>
Rabbit. It looks to me that if I want to scale past the ingress<br>
capabilities of one queue or exchange I have to manage that on the<br>
producer and consumer side.<br>
<br>
I can create multiple exchanges and bind to the same queue if the<br>
routing becomes the bottleneck, but then the producers need to round<br>
robin between the exchanges.<br>
<br>
I can create multiple queues bound with different routing keys (flow1,<br>
flow2) if the queue becomes the bottleneck, but then the producer<br>
needs to know to round robin to the different routing keys and the<br>
consumers need to check both queues.<br>
<br>
So in essence, when I mentioned scalability, it was a reference to<br>
being able to transparently scale the messaging system to multiple<br>
boxes. And more specifically, I want my hundreds of producers to post<br>
messages to a single exchange with a single routing key. I want my<br>
hundreds of consumers to be able to consume messages off a single<br>
queue. I want the exchange and the queue to be scalable (in the<br>
multi-box, multi-process sense) where the messaging system handles it.<br>
I want the messaging system to be scalable like the key/value store is<br>
scalable. Transparently across many boxes.<br>
<br>
There's really only one part of my system that has this requirement.<br>
There are plenty of other aspects in which I'll use messaging and not<br>
have these kinds of insane needs. As I work more with the system it's<br>
likely that I'll want to use more complex routing logic. It's possible<br>
I'll want to break updates from domains into separate message flows.<br>
<br>
Thank you very much for being so helpful. Sorry for the lengthy response.<br>
Paul<br>
<br>
On Tue, Aug 18, 2009 at 4:20 AM, Alexis<br>
<div><div></div><div class="h5">Richardson<<a href="mailto:alexis.richardson@gmail.com">alexis.richardson@gmail.com</a>> wrote:<br>
> Paul,<br>
><br>
> On Mon, Aug 17, 2009 at 8:36 PM, Paul Dix<<a href="mailto:paul@pauldix.net">paul@pauldix.net</a>> wrote:<br>
>> Yeah, that's what I'm talking about. There will probably be upwards of<br>
>> a few hundred producers and a few hundred consumers.<br>
><br>
> Cool.<br>
><br>
> So one question you need to answer is: do you want all the consumers<br>
> to receive the same messages? I.e.:<br>
><br>
> * are you aggregating all the producers into one 'firehose', and then<br>
> sending the whole firehose on to all connected consumers?<br>
><br>
> OR<br>
><br>
> * are you planning to in some way share messages out amongst connected<br>
> consumers, eg on a round robin basis<br>
><br>
> See more below re flow1, flow2...<br>
><br>
><br>
>> The total ingress<br>
>> is definitely what I'm most worried about right now.<br>
><br>
> OK.<br>
><br>
> Be aware that in high ingress rate cases you may be limited by the<br>
> client egress rate, which is strongly implementation and platform<br>
> dependent. Also, see Matthias' notes on testing performance, which<br>
> are googleable from the rabbitmq archives, if you want to run some<br>
> test cases at any point.<br>
><br>
><br>
><br>
>> Later, memory may<br>
>> be a concern, but hopefully the consumers are pulling so quickly that<br>
>> the queue never gets extremely large.<br>
><br>
> Yep.<br>
><br>
><br>
>> Can you give me more specific details (or a pointer) to how the flow1,<br>
>> flow2 thing work (both producer and consumer side)?<br>
><br>
> Sure.<br>
><br>
> First you need to read up on what 'direct exchanges' are and how they<br>
> work in AMQP. I recommend Jason's intro to get you started:<br>
><br>
> <a href="http://blogs.digitar.com/jjww/2009/01/rabbits-and-warrens/" target="_blank">http://blogs.digitar.com/jjww/2009/01/rabbits-and-warrens/</a><br>
><br>
> More background info can be found here: <a href="http://www.rabbitmq.com/how" target="_blank">www.rabbitmq.com/how</a><br>
><br>
> In a nutshell, RabbitMQ will route any message it receives on to one<br>
> or more queues.<br>
><br>
> Each queue lives on a node, and nodes are members of a cluster. You<br>
> can have one or more nodes per machine - a good guide is to have one<br>
> per core. You can send messages to any node in the cluster and they<br>
> will get routed to the right places (adding more nodes to a cluster is<br>
> how you scale ingress and availability).<br>
><br>
> The routing model is based on message routing keys: queues receive<br>
> messages whose routing keys match routing patterns ("bindings"). Note<br>
> that multiple queues can request messages matching the same key,<br>
> giving you 1-many pubsub. This is explained in Jason's article. I<br>
> suggest you use the 'direct exchange' routing model, in which each<br>
> message has one routing key, e.g.: "flow1", "flow2".<br>
><br>
> Take a look at the article and let us know if it all makes sense.<br>
><br>
> alexis<br>
><br>
><br>
>> Thanks,<br>
>> Paul<br>
>><br>
>> On Mon, Aug 17, 2009 at 2:32 PM, Alexis<br>
>> Richardson<<a href="mailto:alexis.richardson@gmail.com">alexis.richardson@gmail.com</a>> wrote:<br>
>>> On Mon, Aug 17, 2009 at 5:22 PM, Paul Dix<<a href="mailto:paul@pauldix.net">paul@pauldix.net</a>> wrote:<br>
>>>> So what exactly does option 1 look like?<br>
>>>><br>
>>>> It sounds like it's possible to have a queue with the same id on two<br>
>>>> different nodes bound to the same exchange.<br>
>>><br>
>>> Not quite. Same routing - two queues, two ids. Actually now that I<br>
>>> think about it that won't give you exactly what you need. More below.<br>
>>><br>
>>><br>
>>>> Will the exchange will<br>
>>>> then round robin the messages to the two different queues? If so,<br>
>>>> that's exactly what I'm looking for. I don't really care about order<br>
>>>> on this queue.<br>
>>><br>
>>> No it won't and that's why my suggestion was wrong.<br>
>>><br>
>>> Round robin does occur when you have two consumers (clients) connected<br>
>>> to one queue. This WILL help you by draining the queue faster, if<br>
>>> memory is a limitation.<br>
>>><br>
>>> If total ingress is the limitation you can increase that by splitting<br>
>>> the flow. Suppose you start with one queue bound once to one exchange<br>
>>> with key "flow1". Then all messages with routing key flow1 will go to<br>
>>> that queue. When load is heavy, add a queue with key "flow2", on a<br>
>>> second node. Then, alternate (if you prefer, randomly) between<br>
>>> routing keys flow1 and flow2. This will spread the load as you<br>
>>> require. And so on, for more queues.<br>
>>><br>
>>> You can make this part of a load balancing layer on the server side,<br>
>>> so that clients don't have to be coded too much.<br>
>>><br>
>>> Is this along the lines of what you need? Let me know, and I can elaborate.<br>
>>><br>
>>> alexis<br>
>>><br>
>>><br>
>>><br>
>>><br>
>>>> Thanks,<br>
>>>> Paul<br>
>>>><br>
>>>> On Mon, Aug 17, 2009 at 10:55 AM, Alexis<br>
>>>> Richardson<<a href="mailto:alexis.richardson@gmail.com">alexis.richardson@gmail.com</a>> wrote:<br>
>>>>> Paul<br>
>>>>><br>
>>>>> On Mon, Aug 17, 2009 at 3:34 PM, Paul Dix<<a href="mailto:paul@pauldix.net">paul@pauldix.net</a>> wrote:<br>
>>>>>> Sorry for the confusion. I mean scalability on a single queue. Say I<br>
>>>>>> want to push 20k messages per second through a single queue. If a<br>
>>>>>> single node can't handle that it seems I'm out of luck. That is, if<br>
>>>>>> I'm understanding how things work.<br>
>>>>><br>
>>>>> You can in principle just add more nodes to the cluster. More details below.<br>
>>>>><br>
>>>>><br>
>>>>><br>
>>>>><br>
>>>>>> So I guess I'm not worried about total queue size, but queue<br>
>>>>>> throughput (although size may become an issue, I'm not sure). It seems<br>
>>>>>> the solution is to split out across multiple queues, but I was hoping<br>
>>>>>> to avoid that since it will add a layer of complexity to my producers<br>
>>>>>> and consumers.<br>
>>>>><br>
>>>>> 1. To maximise throughput, don't use persistence. To make it bigger,<br>
>>>>> forget about ordering. So for example, you can easily have two<br>
>>>>> queues, one per node, subscribed to the same direct exchange with the<br>
>>>>> same key, and you ought to double throughput (assuming all other<br>
>>>>> things being equal and fair).<br>
>>>>><br>
>>>>> 2. If you want to be both fast and 'reliable' (no loss of acked<br>
>>>>> messages), then add more queues and make them durable, and set<br>
>>>>> messages to be persistent.<br>
>>>>><br>
>>>>> 3. If you want to preserve ordering, label each message with an ID and<br>
>>>>> dedup at the endpoints. This does as you say, add some small noise to<br>
>>>>> your producers and consumers, but the above two options 1 and 2, do<br>
>>>>> not.<br>
>>>>><br>
>>>>><br>
>>>>>> I don't think I understand how using Linux-HA with clustering would<br>
>>>>>> lead to a splitting a single queue across multiple nodes. I'm not<br>
>>>>>> familiar with HA, but it looked like it was a solution to provide a<br>
>>>>>> replicated failover.<br>
>>>>><br>
>>>>> You are right that HA techniques, indeed any kind of queue replication<br>
>>>>> or replicated failover, will not help you here.<br>
>>>>><br>
>>>>> What you want is 'flow over' ie. "when load is high, make a new node<br>
>>>>> with the same routing info". This is certainly doable.<br>
>>>>><br>
>>>>> alexis<br>
>>>>><br>
>>>>><br>
>>>>><br>
>>>>><br>
>>>>><br>
>>>>><br>
>>>>>> Thanks again,<br>
>>>>>> Paul<br>
>>>>>><br>
>>>>>> On Mon, Aug 17, 2009 at 10:24 AM, Tony Garnock-Jones<<a href="mailto:tonyg@lshift.net">tonyg@lshift.net</a>> wrote:<br>
>>>>>>> Paul Dix wrote:<br>
>>>>>>>> Do you have a roadmap for when a scalable queue<br>
>>>>>>>> will be available?<br>
>>>>>>><br>
>>>>>>> If by "scalable" you mean "replicated", then that's available now, by<br>
>>>>>>> configuration along the lines I hinted at in my previous message. Adding<br>
>>>>>>> clustering into the mix can help increase capacity, on top of that (at a<br>
>>>>>>> certain cost in configuration complexity).<br>
>>>>>>><br>
>>>>>>> If instead you mean "exceeding RAM+swap size", we're hoping to have that<br>
>>>>>>> for the 1.7 release -- which ought to be out within a month or so.<br>
>>>>>>><br>
>>>>>>>> Just to give you a little more information on what I'm doing, I'm<br>
>>>>>>>> building a live search/aggregation system. I'm hoping to push updates<br>
>>>>>>>> of a constant internet crawl through the messaging system so workers<br>
>>>>>>>> can analyze the content and build indexes as everything comes in.<br>
>>>>>>><br>
>>>>>>> Sounds pretty cool!<br>
>>>>>>><br>
>>>>>>> Tony<br>
>>>>>>> --<br>
>>>>>>> [][][] Tony Garnock-Jones | Mob: +44 (0)7905 974 211<br>
>>>>>>> [][] LShift Ltd | Tel: +44 (0)20 7729 7060<br>
>>>>>>> [] [] <a href="http://www.lshift.net/" target="_blank">http://www.lshift.net/</a> | Email: <a href="mailto:tonyg@lshift.net">tonyg@lshift.net</a><br>
>>>>>>><br>
>>>>>><br>
>>>>>> _______________________________________________<br>
>>>>>> rabbitmq-discuss mailing list<br>
>>>>>> <a href="mailto:rabbitmq-discuss@lists.rabbitmq.com">rabbitmq-discuss@lists.rabbitmq.com</a><br>
>>>>>> <a href="http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss" target="_blank">http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss</a><br>
>>>>>><br>
>>>>><br>
>>>><br>
>>><br>
>><br>
><br>
<br>
_______________________________________________<br>
rabbitmq-discuss mailing list<br>
<a href="mailto:rabbitmq-discuss@lists.rabbitmq.com">rabbitmq-discuss@lists.rabbitmq.com</a><br>
<a href="http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss" target="_blank">http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss</a><br>
</div></div></blockquote></div><br>