[rabbitmq-discuss] Design Help?

Tue Feb 28 20:23:48 GMT 2012

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jerry,

Thanks for the suggestions. You're right that this is an abuse of the
messaging patterns, but the way our system is structured seems to
require building these higher level concepts on top of the primitives we
get from AMQP/RabbitMQ.

We have a system spread out over many servers, across datacenters, with
redundant applications running, and we need ways to coordinate work
amongst the redundant workers in this manner. When we don't care about
the order of work, its really easy to have the queue round-robin the
work to each worker.

As we're building more advanced features into our platform, this pattern
of only allowing one worker to process a job, across the whole platform,
is popping up in a few places and we'd like to encapsulate that pattern
into a reusable bit of code (a behaviour in Erlang).

AMQP is the glue that binds all these apps together, so we look to build
on top of its feature set.

I'll let the list know more if I figure a tenable solution :)

On 02/28/2012 11:53 AM, Jerry Kuch wrote:
> Hi, James:
> 
> Some of this appears to go beyond messaging per se and into more
> general distributed systems/services coordination.  Have you contemplated
> using something like Apache Zookeeper or Heroku's Doozer?  The primitives
> provided by those may lend themselves well to integrating-with/supporting
> your application, while the messaging heavy lifting continues using Rabbit
> and its features/guarantees in the ways they work best without strange
> fetch-peek-ack-nack trickery that will likely make your application 
> pretty complex and hard to reason about.
> 
> http://zookeeper.apache.org/
> 
> http://xph.us/2011/04/13/introducing-doozer.html
> 
> Doozer is quite new; Zookeeper by virtue of its position in the
> Hadoop ecology is older, more mature and well documented both online
> and in books...
> 
> Best regards,
> Jerry
> 
> ----- Original Message -----
> From: "James Aimonetti" <james at 2600hz.com>
> To: rabbitmq-discuss at lists.rabbitmq.com
> Sent: Monday, February 20, 2012 9:04:12 AM
> Subject: [rabbitmq-discuss] Design Help?
> 
> Hey list,
> 
> Curious for feedback, as this design idea I have would possibly extend
> to several places in our application, and I'd love to make it generic
> enough to do so.
> 
> The basic problem:
> 
> We have a queue of work to be done, represented as a named queue. For
> redundancy, we have 2->many processes (in Erlang) that will read from
> the queue and do the work. Easy so far.
> 
> However, we also have a list of resources needed to process the work
> queue, which needs to be shared across all worker processes. When one
> worker retrieves a job from the work queue, it must also get the list of
> available resources, match resource to job, and return the list of
> resources (with the used resource flagged as such) back to a globally
> accessible place (hoping to avoid the DB, but it may be a good-enough
> sync point).
> 
> So, for example, there are four jobs, two that require a paper shredder,
> two that require a wood chipper. There is only one paper shredder and
> one wood chipper. There are also 10 workers available to actually shred
> stuff.
> 
> Worker one consumes the list of available equipment (one paper shredder,
> one wood chipper), and then pulls the first job (paper to be shredded).
> Worker one grabs the paper shredder, returns the list of resources
> having flagged the paper shredder as in use, and ACKs the job. When
> worker one finishes shredding paper, it will wait until the resources
> list is available and update it with 1 more paper shredder available.
> 
> Once a job is actually started, I don't care if the worker crashes or
> not; that will be handled later. And I'll have a supervising gen_server
> pull the resource list, add the resource its using to its state, and
> spawn the actual worker (catching exits to return the resource). I'm
> sure other measures will be needed to ensure the resource is added back
> eventually, but I'm trying to get the basic idea down.
> 
> So once worker one is done retrieving a resource, worker two can next
> get the resources list and the next job. If the job is paper shredding,
> worker two will nack the job (as there are no paper shredders at the
> moment) and return the resource list. I will eventually want to support
> being able to requeue the job to the back of the job queue, but for now,
> work stalls until the first item in the queue can be processed successfully.
> 
> So while the resources list is out of queue, no workers will consume
> jobs from the work queue. As soon as it reappears in queue, I want the
> next worker in line to grab it and try to process the next job.
> 
> 
> Some initial concerns:
> 
> No resources available for the job at the front of the queue means a lot
> of fetching and nacking of jobs, and fetching/returning the resource list.
> 
> Losing the resources list: if whatever node (our Erlang VMs only
> communicate via AMQP) goes down while handling the resources list, it
> will obviously never be republished. We do usually have a timeout window
> for the worker to get a resource for a job (sometimes the actual finding
> of the resource takes a while), so if I could coordinate all workers
> consuming from the resources_available queue, I could know after timeout
> + a small fudge factor, that someone else should have begun processing.
> So maybe the resources_available queue is more a try_to_process_a_job
> queue, and each worker process maintains an identical list of resources
> available, with a worker that pairs a job to resource publishing the
> resource's in-use status.
> 
> So perhaps the solution is:
> 
> Jobs queue round robins amongst all workers, but only one worker can
> fetch/(n)ack. A worker can only fetch from the jobs queue by consuming a
> "next" message from a shared coordination queue. All workers publish
> updates to resources available to a routing key so all workers'
> resources state is in sync. If a worker can't find a resource for a job,
> it can (configurable) 1) nack the job and send a 'next' to the
> coordination queue; 2) requeue the job at the end of the jobs queue and
> send a 'next'; 3) sit on the job (meaning no other worker can fetch from
> the jobs queue) until an appropriate resource comes available.
> 
> Concerns here:
> 
> Starting/recovering the 'next' token.
> 
> I'm sure others but am realizing this is getting long.
> 
> Sorry for the wall of text and stream of consciousness writing. We have
> a lot of places where this pattern is showing up and would love to get a
> module or library written to ease doing this kind of coordination. Any
> ideas or directions towards appropriate messaging patterns warmly welcome :)
> 
_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss at lists.rabbitmq.com
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss

- -- 
James Aimonetti
Distributed Systems Engineer / DJ MC_

2600hz | http://2600hz.com
sip:james at 2600hz.com
tel: 415.886.7905
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJPTTfUAAoJENc77s1OYoGgdq0H/2X7oohyxx0MhsoD06J07l2P
DZm7PdEEZhUCSRGb43cBszNjiFLvYtmbSGgWUEzQMjZvI0idKC6m4XIgZqKfJ7Wk
yVQcvMfeHmE0w1NIyrJLFlXToONHlX2mLStRHnsmhFdMZuDHuAwF41N3rZPX5/wV
JIUIvYg45VccYu0zL4kRq/9nBORJsfXxXcsB8pdKJa0uLxv2OPptDOHFgZVz/wVe
FC99/fBU/WixtDvzhe2D7d+ubn5UT1jf01ZKUTAgkesM3EpcmqdqI51jQR2Q0hbT
VhVODLSlTSrYSYsumStXOaFzZ8kwY7+mT3UDbRqtv+ThIKuWWejCXGFVjDeKcA8=
=NhsO
-----END PGP SIGNATURE-----