[rabbitmq-discuss] questions about distributed queue

Alexis Richardson alexis.richardson at gmail.com
Wed Aug 19 14:06:12 BST 2009


Jim,

On Tue, Aug 18, 2009 at 2:47 PM, Jim Irrer<irrer at umich.edu> wrote:
>
> To help load balancing, could the consumers be set up to, instead of round
> robin,
> simply each try to read from a common queue, and who ever gets there first
> gets the message.
> This would mean that each consumer only gets a message when they become
> idle,
> which seems like what would be wanted.

That's right.  If you attach N consumers to one queue, then they will
treat that as a shared resource, so that message consumption is
round-robined.

alexis




> On the producer side, if there were multiple queues, the producer would want
> to
> write to the queue with the least amount of messages on it.
>
> I'm trying to learn AMQP too and this has been an interesting discussion to
> watch.
>
> Thanks,
>
> - Jim
>
> Jim Irrer     irrer at umich.edu       (734) 647-4409
> University of Michigan Hospital Radiation Oncology
> 519 W. William St.             Ann Arbor, MI 48103
>
>
> On Tue, Aug 18, 2009 at 9:18 AM, Paul Dix <paul at pauldix.net> wrote:
>>
>> All of that makes sense.
>>
>> Let me give some more specifics about what I'm building and how I'm
>> hoping to use the messaging system. I'm doing a constant internet
>> crawl of sorts, twitter updates and everything else are in there. So
>> when something gets pulled down the document gets inserted into a
>> horizontally scalable key value store in the sky. I then want to send
>> a message through the system that this key/value has been
>> inserted/updated. This is being done by 20-100 boxes.
>>
>> I then want that message to be grabbed by a consumer where some
>> processing will happen and probably some ranking, relevance and other
>> things get written to an index somewhere (also being done by a large
>> number of boxes).
>>
>> So for this specific case I'm using a direct exchange with a single
>> queue (no message persistence and don't bother keeping ordering).
>> Hundreds of producers are posting messages to the exchange with the
>> same routing key and hundreds of consumers are pulling off the queue.
>> It's the firehose thing. Each message has to be processed once by any
>> one of the hundreds of consumers.
>>
>> I guess I was hoping for the flow management part to be handled by
>> Rabbit. It looks to me that if I want to scale past the ingress
>> capabilities of one queue or exchange I have to manage that on the
>> producer and consumer side.
>>
>> I can create multiple exchanges and bind to the same queue if the
>> routing becomes the bottleneck, but then the producers need to round
>> robin between the exchanges.
>>
>> I can create multiple queues bound with different routing keys (flow1,
>> flow2) if the queue becomes the bottleneck, but then the producer
>> needs to know to round robin to the different routing keys and the
>> consumers need to check both queues.
>>
>> So in essence, when I mentioned scalability, it was a reference to
>> being able to transparently scale the messaging system to multiple
>> boxes. And more specifically, I want my hundreds of producers to post
>> messages to a single exchange with a single routing key. I want my
>> hundreds of consumers to be able to consume messages off a single
>> queue. I want the exchange and the queue to be scalable (in the
>> multi-box, multi-process sense) where the messaging system handles it.
>> I want the messaging system to be scalable like the key/value store is
>> scalable. Transparently across many boxes.
>>
>> There's really only one part of my system that has this requirement.
>> There are plenty of other aspects in which I'll use messaging and not
>> have these kinds of insane needs. As I work more with the system it's
>> likely that I'll want to use more complex routing logic. It's possible
>> I'll want to break updates from domains into separate message flows.
>>
>> Thank you very much for being so helpful. Sorry for the lengthy response.
>> Paul
>>
>> On Tue, Aug 18, 2009 at 4:20 AM, Alexis
>> Richardson<alexis.richardson at gmail.com> wrote:
>> > Paul,
>> >
>> > On Mon, Aug 17, 2009 at 8:36 PM, Paul Dix<paul at pauldix.net> wrote:
>> >> Yeah, that's what I'm talking about. There will probably be upwards of
>> >> a few hundred producers and a few hundred consumers.
>> >
>> > Cool.
>> >
>> > So one question you need to answer is: do you want all the consumers
>> > to receive the same messages?  I.e.:
>> >
>> > * are you aggregating all the producers into one 'firehose', and then
>> > sending the whole firehose on to all connected consumers?
>> >
>> > OR
>> >
>> > * are you planning to in some way share messages out amongst connected
>> > consumers, eg on a round robin basis
>> >
>> > See more below re flow1, flow2...
>> >
>> >
>> >> The total ingress
>> >> is definitely what I'm most worried about right now.
>> >
>> > OK.
>> >
>> > Be aware that in high ingress rate cases you may be limited by the
>> > client egress rate, which is strongly implementation and platform
>> > dependent.  Also, see Matthias' notes on testing performance, which
>> > are googleable from the rabbitmq archives, if you want to run some
>> > test cases at any point.
>> >
>> >
>> >
>> >> Later, memory may
>> >> be a concern, but hopefully the consumers are pulling so quickly that
>> >> the queue never gets extremely large.
>> >
>> > Yep.
>> >
>> >
>> >> Can you give me more specific details (or a pointer) to how the flow1,
>> >> flow2 thing work (both producer and consumer side)?
>> >
>> > Sure.
>> >
>> > First you need to read up on what 'direct exchanges' are and how they
>> > work in AMQP.  I recommend Jason's intro to get you started:
>> >
>> > http://blogs.digitar.com/jjww/2009/01/rabbits-and-warrens/
>> >
>> > More background info can be found here: www.rabbitmq.com/how
>> >
>> > In a nutshell, RabbitMQ will route any message it receives on to one
>> > or more queues.
>> >
>> > Each queue lives on a node, and nodes are members of a cluster.  You
>> > can have one or more nodes per machine - a good guide is to have one
>> > per core.  You can send messages to any node in the cluster and they
>> > will get routed to the right places (adding more nodes to a cluster is
>> > how you scale ingress and availability).
>> >
>> > The routing model is based on message routing keys: queues receive
>> > messages whose routing keys match routing patterns ("bindings").  Note
>> > that multiple queues can request messages matching the same key,
>> > giving you 1-many pubsub.  This is explained in Jason's article.  I
>> > suggest you use the 'direct exchange' routing model, in which each
>> > message has one routing key, e.g.: "flow1", "flow2".
>> >
>> > Take a look at the article and let us know if it all makes sense.
>> >
>> > alexis
>> >
>> >
>> >> Thanks,
>> >> Paul
>> >>
>> >> On Mon, Aug 17, 2009 at 2:32 PM, Alexis
>> >> Richardson<alexis.richardson at gmail.com> wrote:
>> >>> On Mon, Aug 17, 2009 at 5:22 PM, Paul Dix<paul at pauldix.net> wrote:
>> >>>> So what exactly does option 1 look like?
>> >>>>
>> >>>> It sounds like it's possible to have a queue with the same id on two
>> >>>> different nodes bound to the same exchange.
>> >>>
>> >>> Not quite.  Same routing - two queues, two ids.  Actually now that I
>> >>> think about it that won't give you exactly what you need.  More below.
>> >>>
>> >>>
>> >>>> Will the exchange will
>> >>>> then round robin the messages to the two different queues? If so,
>> >>>> that's exactly what I'm looking for. I don't really care about order
>> >>>> on this queue.
>> >>>
>> >>> No it won't and that's why my suggestion was wrong.
>> >>>
>> >>> Round robin does occur when you have two consumers (clients) connected
>> >>> to one queue.  This WILL help you by draining the queue faster, if
>> >>> memory is a limitation.
>> >>>
>> >>> If total ingress is the limitation you can increase that by splitting
>> >>> the flow.  Suppose you start with one queue bound once to one exchange
>> >>> with key "flow1".  Then all messages with routing key flow1 will go to
>> >>> that queue.  When load is heavy, add a queue with key "flow2", on a
>> >>> second node.  Then, alternate (if you prefer, randomly) between
>> >>> routing keys flow1 and flow2.  This will spread the load as you
>> >>> require.  And so on, for more queues.
>> >>>
>> >>> You can make this part of a load balancing layer on the server side,
>> >>> so that clients don't have to be coded too much.
>> >>>
>> >>> Is this along the lines of what you need?  Let me know, and I can
>> >>> elaborate.
>> >>>
>> >>> alexis
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>> Thanks,
>> >>>> Paul
>> >>>>
>> >>>> On Mon, Aug 17, 2009 at 10:55 AM, Alexis
>> >>>> Richardson<alexis.richardson at gmail.com> wrote:
>> >>>>> Paul
>> >>>>>
>> >>>>> On Mon, Aug 17, 2009 at 3:34 PM, Paul Dix<paul at pauldix.net> wrote:
>> >>>>>> Sorry for the confusion. I mean scalability on a single queue. Say
>> >>>>>> I
>> >>>>>> want to push 20k messages per second through a single queue. If a
>> >>>>>> single node can't handle that it seems I'm out of luck. That is, if
>> >>>>>> I'm understanding how things work.
>> >>>>>
>> >>>>> You can in principle just add more nodes to the cluster.  More
>> >>>>> details below.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>> So I guess I'm not worried about total queue size, but queue
>> >>>>>> throughput (although size may become an issue, I'm not sure). It
>> >>>>>> seems
>> >>>>>> the solution is to split out across multiple queues, but I was
>> >>>>>> hoping
>> >>>>>> to avoid that since it will add a layer of complexity to my
>> >>>>>> producers
>> >>>>>> and consumers.
>> >>>>>
>> >>>>> 1. To maximise throughput, don't use persistence.  To make it
>> >>>>> bigger,
>> >>>>> forget about ordering.  So for example, you can easily have two
>> >>>>> queues, one per node, subscribed to the same direct exchange with
>> >>>>> the
>> >>>>> same key, and you ought to double throughput (assuming all other
>> >>>>> things being equal and fair).
>> >>>>>
>> >>>>> 2. If you want to be both fast and 'reliable' (no loss of acked
>> >>>>> messages), then add more queues and make them durable, and set
>> >>>>> messages to be persistent.
>> >>>>>
>> >>>>> 3. If you want to preserve ordering, label each message with an ID
>> >>>>> and
>> >>>>> dedup at the endpoints.  This does as you say, add some small noise
>> >>>>> to
>> >>>>> your producers and consumers, but the above two options 1 and 2, do
>> >>>>> not.
>> >>>>>
>> >>>>>
>> >>>>>> I don't think I understand how using Linux-HA with clustering would
>> >>>>>> lead to a splitting a single queue across multiple nodes. I'm not
>> >>>>>> familiar with HA, but it looked like it was a solution to provide a
>> >>>>>> replicated failover.
>> >>>>>
>> >>>>> You are right that HA techniques, indeed any kind of queue
>> >>>>> replication
>> >>>>> or replicated failover, will not help you here.
>> >>>>>
>> >>>>> What you want is 'flow over' ie. "when load is high, make a new node
>> >>>>> with the same routing info".  This is certainly doable.
>> >>>>>
>> >>>>> alexis
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>> Thanks again,
>> >>>>>> Paul
>> >>>>>>
>> >>>>>> On Mon, Aug 17, 2009 at 10:24 AM, Tony
>> >>>>>> Garnock-Jones<tonyg at lshift.net> wrote:
>> >>>>>>> Paul Dix wrote:
>> >>>>>>>> Do you have a roadmap for when a scalable queue
>> >>>>>>>> will be available?
>> >>>>>>>
>> >>>>>>> If by "scalable" you mean "replicated", then that's available now,
>> >>>>>>> by
>> >>>>>>> configuration along the lines I hinted at in my previous message.
>> >>>>>>> Adding
>> >>>>>>> clustering into the mix can help increase capacity, on top of that
>> >>>>>>> (at a
>> >>>>>>> certain cost in configuration complexity).
>> >>>>>>>
>> >>>>>>> If instead you mean "exceeding RAM+swap size", we're hoping to
>> >>>>>>> have that
>> >>>>>>> for the 1.7 release -- which ought to be out within a month or so.
>> >>>>>>>
>> >>>>>>>> Just to give you a little more information on what I'm doing, I'm
>> >>>>>>>> building a live search/aggregation system. I'm hoping to push
>> >>>>>>>> updates
>> >>>>>>>> of a constant internet crawl through the messaging system so
>> >>>>>>>> workers
>> >>>>>>>> can analyze the content and build indexes as everything comes in.
>> >>>>>>>
>> >>>>>>> Sounds pretty cool!
>> >>>>>>>
>> >>>>>>> Tony
>> >>>>>>> --
>> >>>>>>>  [][][] Tony Garnock-Jones     | Mob: +44 (0)7905 974 211
>> >>>>>>>   [][] LShift Ltd             | Tel: +44 (0)20 7729 7060
>> >>>>>>>  []  [] http://www.lshift.net/ | Email: tonyg at lshift.net
>> >>>>>>>
>> >>>>>>
>> >>>>>> _______________________________________________
>> >>>>>> rabbitmq-discuss mailing list
>> >>>>>> rabbitmq-discuss at lists.rabbitmq.com
>> >>>>>> http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >
>>
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> rabbitmq-discuss at lists.rabbitmq.com
>> http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>




More information about the rabbitmq-discuss mailing list