[rabbitmq-discuss] Application architecture question: queue failure

Bill Moseley moseley at hank.org
Mon Jun 18 19:02:39 BST 2012


On Mon, Jun 18, 2012 at 12:29 PM, Tim Watson <tim at rabbitmq.com> wrote:

>
> I'm a bit confused now. Where do you set this 'in process' status - on the
> newly submitted message, or in the database record or in some field in the
> originally submitted message(s)?


The application database.  When the user say "I want a report" then flag
the start time, and the state (initially "pending") and send of the
message.

The idea is the worker picks up the job and atomically sets it from
"pending" to "in process" -- which means even if the job was queued
multiple times only one process would pick up the actual work.

Then when the job is completed again the state is changed from "in process"
to "completed".




>
>
>  Maybe you are right that durable queues are the correct solution for
>> this -- I still need to track state on the web app side to show
>> "pending" or "in process".   And maybe just use cron to report/clean up
>> any stale pending job on the web app side.
>>
>> I'm just curious if the above is a common design pattern when using
>> RabbitMQ in this way.  Obviously, depends on the specifics of the task,
>> but we seem to have quite a few situations like this.
>>
>>
> I still don't understand the difference between 'stale' and 'pending'.
> Whether you do this based on timestamp or uuid or whatever, you need *some*
> mechanism to avoid duplicating work. Because AMQP cannot reliably do 'only
> once' delivery without consumer intervention, I would expect that you need
> to track which jobs have been handled and which have not. What I don't
> understand is how this pending/stale flag helps you, nor why cron jobs are
> an attractive choice to deal with expiring messages.
>

Well, that's essentially my question.  Obviously, if I want the web app to
know that a report request was made so it can display to the user that the
report is in the process of being generated.  And I also want to prevent
multiple submissions by a user for the same thing.  So, the database serves
this function.

The difficulty is when it gets stuck in pending.   At what point do we give
up or try again?

Thans for your comments below.  I think the solution with the dead letter
is the way to go as it avoids using something like cron to handle extra
processing.   This way the task is always "in the system" in a controlled
way.

Then not over engineer for the very rare chance of a failure.   May don't
even really need the durable queues if I can run a utility to resubmit
stuck "pending" jobs in those rare cases.


Thanks very much for your input.


It seems to me there are a few separate problem domains here, which are
> getting tangled up in our discussion. I would posit that you need to deal
> with
>
> 1. Making sure a job/task has definitely been 'registered' with the system.
>
> 2. Indicating the outcome of (1) to your users
>
> 3. Avoiding 're-submitting' the same job/task many times
>
> 4. Dealing with failures in external services
>
> Please feel free to correct that list or add to it or whatever.
>
> When it comes to (1), as I mentioned durable queues with persistent
> messages are the way to go. Once a message is 'on disk' then it is
> reasonable for you to assume that the job is safely 'in the system' now. I
> won't pretend that this represents a complete disaster recovery solution,
> because we both know it does not. I do feel, however, that such a solution
> involves *far* more technology that just your message broker, so I'm going
> to gently push it out of scope for the purposes of this discussion. :)
>
> As for (2), in the absence of queue browsing, you are probably doing the
> right thing already in terms of storing a record in your web application's
> database to indicate that the job has indeed been submitted (and is now in
> a pending state).
>
> Your problem with 'duplicate tasks' appears to happen mainly because your
> cron job 're-submits' the message. With a persistent queue, there would be
> no need to do this at all, as the message is on disk and will survive a
> broker crash (though it won't survive if your data centre slips off a cliff
> into the ocean).
>
> What I'd suggest is a slightly different approach. Set up your durable
> task queue with a 'dead letter exchange' so that expiring messages (or
> those rejected with `requeue=false`) will be shoved into that exchange. Now
> set up the target (dead letter) exchange to publish to another (durable)
> queue, let's call it 'redelivery', and make sure this is configured to stay
> around even when there are no consumers.
>
> Set up a 'permanent subscriber' to the 'redelivery' queue - i.e., have an
> always running thread consuming these messages and make sure it is
> restarted if it fails for any reason - and have this subscriber take each
> arriving message and re-submit it to the original task queue.
>
> Finally, when submitting jobs to the task queue, set the TTL to a
> reasonable value (for your application's needs) and this is what will
> happen:
>
> 1. you submit the task
> 2. the task TTL expires after the correct time lapse
> 3. the broker sends the 'expired' message to the 'dead letter exchange'
> 4. the exchange routes the message to the 'redelivery' queue
> 5. the redelivery queue re-submits the message (as a new message!) to the
> task queue
> 6. a consumer (job) grabs the message before it expires this time
> 7. the job (process/thread/application) fails (due to an external service
> error or whatever)
> 8. the job (process/thread/application) rejects the job with
> `requeue=false`
> 9. steps 3, 4 and 5 run again
> 10. eventually something good happens!?
>
> Actually to deal properly with 10, you probably want to keep some kind of
> timestamp with the message and in the consumer that is reading the
> 'redelivery' queue and re-submitting jobs, allow the message to time out
> and set an error flag in the database (or something).
>
> If you *do* have some kind of external identity that holds for the
> (conceptual) lifetime of the task, then you could store this original
> timestamp in the database and query that against the task id, but obviously
> you'll need to consider the potential performance (and architectural)
> implications of doing that for yourself.
>
> Step (8) might also be problematic if your tasks take a long time to
> complete, so you may wish to rework that state in terms of re-submitting
> instead of rejecting the message. As long as you have heartbeats enabled,
> your consumer channel shouldn't be closed, but until you've ack'ed the
> message one way or another, other consumers could 'get' it and therefore
> you'll need to make them idempotent to deal with this.
>
> What ever you choose to do, the database needs to be properly updated when
> a task does finally succeed. The fact that you *must* do this at some point
> already (in order for the UI to be consistent) means you already have a
> thread of identity, and therefore you should be able to use this to create
> idempotent consumers where duplicate tasks are potentially an issue.
>
>
>  Oh, and with this example of the third-party web service another problem
>> is knowing if a failure of this service is permanent or temporary.  I
>> have not done this, but I'm tempted to have my workers pull the jobs off
>> the queue and if the job fails for an unclear reason then ack the
>> original job and then send it to a "try again later" queue and have
>> separate workers handle those.
>>
>>
> That sounds like a reasonable approach. It is somewhat similar to the
> approach I describe above.
>



-- 
Bill Moseley
moseley at hank.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120618/316aae66/attachment.htm>


More information about the rabbitmq-discuss mailing list