[rabbitmq-discuss] Application architecture question: queue failure
Tim Watson
tim at rabbitmq.com
Mon Jun 18 18:29:34 BST 2012
On 18/06/2012 16:37, Bill Moseley wrote:
>
> Let me give you an example -- which is an actual workflow we have.
>
> In our web app a user can select to receive a report. In the web app we
> want the user to see that the report is indeed queued, so in the
> database we set a flag saying that the job was sent, and when. This
> allows us to display "pending" so the user doesn't submit the request
> multiple times.
>
So you already have some kind of identity, in order to uniquely identify
the job in the database.
> The web app queues the message for the background report generation.
> Anything is possible -- so imagine first that the message is somehow
> lost. The web app is still showing "pending" to the user.
>
Well that's true, anything *is* possible. You're entire data centre and
its resident rabbitmq cluster nodes could all disappear off the face of
the earth, the hard drives no longer usable, etc. In order to deal with
that kind of scenario, persistent messages and HA do you no good and
from a disaster recovery standpoint, you definitely need a way to
re-synchronise the database and figure out which jobs have actually been
lost.
> But, we do want the task to complete -- it's a revenue generator, for
> example. So, one option is to use cron to look for stale "pending"
> request on the web side and assume the message was lost and re-queue.
> But, after X attempts maybe the cron job decides to give up.
>
It *sounds* to me like you need the house keeping functions to set a
different completion status on the database record depending on how the
task was resolved.
> Now, this report generation actually uses a third-party web service, and
> this web service has gone down for extended periods for maintenance.
> So, in this case the report request jobs stack up in the queue.
>
> So, if it's down long enough then cron might run again and re-queue the
> same job that is already in the queue. What I have done for this is
> atomically change the state from "pending" to "in process" so that only
> one message gets processed. But, using some kind of UUID and a store is
> another option, of course.
>
I'm a bit confused now. Where do you set this 'in process' status - on
the newly submitted message, or in the database record or in some field
in the originally submitted message(s)?
> Maybe you are right that durable queues are the correct solution for
> this -- I still need to track state on the web app side to show
> "pending" or "in process". And maybe just use cron to report/clean up
> any stale pending job on the web app side.
>
> I'm just curious if the above is a common design pattern when using
> RabbitMQ in this way. Obviously, depends on the specifics of the task,
> but we seem to have quite a few situations like this.
>
I still don't understand the difference between 'stale' and 'pending'.
Whether you do this based on timestamp or uuid or whatever, you need
*some* mechanism to avoid duplicating work. Because AMQP cannot reliably
do 'only once' delivery without consumer intervention, I would expect
that you need to track which jobs have been handled and which have not.
What I don't understand is how this pending/stale flag helps you, nor
why cron jobs are an attractive choice to deal with expiring messages.
It seems to me there are a few separate problem domains here, which are
getting tangled up in our discussion. I would posit that you need to
deal with
1. Making sure a job/task has definitely been 'registered' with the system.
2. Indicating the outcome of (1) to your users
3. Avoiding 're-submitting' the same job/task many times
4. Dealing with failures in external services
Please feel free to correct that list or add to it or whatever.
When it comes to (1), as I mentioned durable queues with persistent
messages are the way to go. Once a message is 'on disk' then it is
reasonable for you to assume that the job is safely 'in the system' now.
I won't pretend that this represents a complete disaster recovery
solution, because we both know it does not. I do feel, however, that
such a solution involves *far* more technology that just your message
broker, so I'm going to gently push it out of scope for the purposes of
this discussion. :)
As for (2), in the absence of queue browsing, you are probably doing the
right thing already in terms of storing a record in your web
application's database to indicate that the job has indeed been
submitted (and is now in a pending state).
Your problem with 'duplicate tasks' appears to happen mainly because
your cron job 're-submits' the message. With a persistent queue, there
would be no need to do this at all, as the message is on disk and will
survive a broker crash (though it won't survive if your data centre
slips off a cliff into the ocean).
What I'd suggest is a slightly different approach. Set up your durable
task queue with a 'dead letter exchange' so that expiring messages (or
those rejected with `requeue=false`) will be shoved into that exchange.
Now set up the target (dead letter) exchange to publish to another
(durable) queue, let's call it 'redelivery', and make sure this is
configured to stay around even when there are no consumers.
Set up a 'permanent subscriber' to the 'redelivery' queue - i.e., have
an always running thread consuming these messages and make sure it is
restarted if it fails for any reason - and have this subscriber take
each arriving message and re-submit it to the original task queue.
Finally, when submitting jobs to the task queue, set the TTL to a
reasonable value (for your application's needs) and this is what will
happen:
1. you submit the task
2. the task TTL expires after the correct time lapse
3. the broker sends the 'expired' message to the 'dead letter exchange'
4. the exchange routes the message to the 'redelivery' queue
5. the redelivery queue re-submits the message (as a new message!) to
the task queue
6. a consumer (job) grabs the message before it expires this time
7. the job (process/thread/application) fails (due to an external
service error or whatever)
8. the job (process/thread/application) rejects the job with `requeue=false`
9. steps 3, 4 and 5 run again
10. eventually something good happens!?
Actually to deal properly with 10, you probably want to keep some kind
of timestamp with the message and in the consumer that is reading the
'redelivery' queue and re-submitting jobs, allow the message to time out
and set an error flag in the database (or something).
If you *do* have some kind of external identity that holds for the
(conceptual) lifetime of the task, then you could store this original
timestamp in the database and query that against the task id, but
obviously you'll need to consider the potential performance (and
architectural) implications of doing that for yourself.
Step (8) might also be problematic if your tasks take a long time to
complete, so you may wish to rework that state in terms of re-submitting
instead of rejecting the message. As long as you have heartbeats
enabled, your consumer channel shouldn't be closed, but until you've
ack'ed the message one way or another, other consumers could 'get' it
and therefore you'll need to make them idempotent to deal with this.
What ever you choose to do, the database needs to be properly updated
when a task does finally succeed. The fact that you *must* do this at
some point already (in order for the UI to be consistent) means you
already have a thread of identity, and therefore you should be able to
use this to create idempotent consumers where duplicate tasks are
potentially an issue.
> Oh, and with this example of the third-party web service another problem
> is knowing if a failure of this service is permanent or temporary. I
> have not done this, but I'm tempted to have my workers pull the jobs off
> the queue and if the job fails for an unclear reason then ack the
> original job and then send it to a "try again later" queue and have
> separate workers handle those.
>
That sounds like a reasonable approach. It is somewhat similar to the
approach I describe above.
More information about the rabbitmq-discuss
mailing list