[rabbitmq-discuss] Application architecture question: queue failure

Mon Jun 18 18:29:34 BST 2012

On 18/06/2012 16:37, Bill Moseley wrote:
>
> Let me give you an example -- which is an actual workflow we have.
>
> In our web app a user can select to receive a report.  In the web app we
> want the user to see that the report is indeed queued, so in the
> database we set a flag saying that the job was sent, and when.   This
> allows us to display "pending" so the user doesn't submit the request
> multiple times.
>

So you already have some kind of identity, in order to uniquely identify 
the job in the database.

> The web app queues the message for the background report generation.
> Anything is possible -- so imagine first that the message is somehow
> lost.  The web app is still showing "pending" to the user.
>

Well that's true, anything *is* possible. You're entire data centre and 
its resident rabbitmq cluster nodes could all disappear off the face of 
the earth, the hard drives no longer usable, etc. In order to deal with 
that kind of scenario, persistent messages and HA do you no good and 
from a disaster recovery standpoint, you definitely need a way to 
re-synchronise the database and figure out which jobs have actually been 
lost.

> But, we do want the task to complete -- it's a revenue generator, for
> example.   So, one option is to use cron to look for stale "pending"
> request on the web side and assume the message was lost and re-queue.
> But, after X attempts maybe the cron job decides to give up.
>

It *sounds* to me like you need the house keeping functions to set a 
different completion status on the database record depending on how the 
task was resolved.

> Now, this report generation actually uses a third-party web service, and
> this web service has gone down for extended periods for maintenance.
>   So, in this case the report request jobs stack up in the queue.
>
> So, if it's down long enough then cron might run again and re-queue the
> same job that is already in the queue.   What I have done for this is
> atomically change the state from "pending" to "in process" so that only
> one message gets processed.  But, using some kind of UUID and a store is
> another option, of course.
>

I'm a bit confused now. Where do you set this 'in process' status - on 
the newly submitted message, or in the database record or in some field 
in the originally submitted message(s)?

> Maybe you are right that durable queues are the correct solution for
> this -- I still need to track state on the web app side to show
> "pending" or "in process".   And maybe just use cron to report/clean up
> any stale pending job on the web app side.
>
> I'm just curious if the above is a common design pattern when using
> RabbitMQ in this way.  Obviously, depends on the specifics of the task,
> but we seem to have quite a few situations like this.
>

I still don't understand the difference between 'stale' and 'pending'. 
Whether you do this based on timestamp or uuid or whatever, you need 
*some* mechanism to avoid duplicating work. Because AMQP cannot reliably 
do 'only once' delivery without consumer intervention, I would expect 
that you need to track which jobs have been handled and which have not. 
What I don't understand is how this pending/stale flag helps you, nor 
why cron jobs are an attractive choice to deal with expiring messages.

It seems to me there are a few separate problem domains here, which are 
getting tangled up in our discussion. I would posit that you need to 
deal with

1. Making sure a job/task has definitely been 'registered' with the system.

2. Indicating the outcome of (1) to your users

3. Avoiding 're-submitting' the same job/task many times

4. Dealing with failures in external services

Please feel free to correct that list or add to it or whatever.

When it comes to (1), as I mentioned durable queues with persistent 
messages are the way to go. Once a message is 'on disk' then it is 
reasonable for you to assume that the job is safely 'in the system' now. 
I won't pretend that this represents a complete disaster recovery 
solution, because we both know it does not. I do feel, however, that 
such a solution involves *far* more technology that just your message 
broker, so I'm going to gently push it out of scope for the purposes of 
this discussion. :)

As for (2), in the absence of queue browsing, you are probably doing the 
right thing already in terms of storing a record in your web 
application's database to indicate that the job has indeed been 
submitted (and is now in a pending state).

Your problem with 'duplicate tasks' appears to happen mainly because 
your cron job 're-submits' the message. With a persistent queue, there 
would be no need to do this at all, as the message is on disk and will 
survive a broker crash (though it won't survive if your data centre 
slips off a cliff into the ocean).

What I'd suggest is a slightly different approach. Set up your durable 
task queue with a 'dead letter exchange' so that expiring messages (or 
those rejected with `requeue=false`) will be shoved into that exchange. 
Now set up the target (dead letter) exchange to publish to another 
(durable) queue, let's call it 'redelivery', and make sure this is 
configured to stay around even when there are no consumers.

Set up a 'permanent subscriber' to the 'redelivery' queue - i.e., have 
an always running thread consuming these messages and make sure it is 
restarted if it fails for any reason - and have this subscriber take 
each arriving message and re-submit it to the original task queue.

Finally, when submitting jobs to the task queue, set the TTL to a 
reasonable value (for your application's needs) and this is what will 
happen:

1. you submit the task
2. the task TTL expires after the correct time lapse
3. the broker sends the 'expired' message to the 'dead letter exchange'
4. the exchange routes the message to the 'redelivery' queue
5. the redelivery queue re-submits the message (as a new message!) to 
the task queue
6. a consumer (job) grabs the message before it expires this time
7. the job (process/thread/application) fails (due to an external 
service error or whatever)
8. the job (process/thread/application) rejects the job with `requeue=false`
9. steps 3, 4 and 5 run again
10. eventually something good happens!?

Actually to deal properly with 10, you probably want to keep some kind 
of timestamp with the message and in the consumer that is reading the 
'redelivery' queue and re-submitting jobs, allow the message to time out 
and set an error flag in the database (or something).

If you *do* have some kind of external identity that holds for the 
(conceptual) lifetime of the task, then you could store this original 
timestamp in the database and query that against the task id, but 
obviously you'll need to consider the potential performance (and 
architectural) implications of doing that for yourself.

Step (8) might also be problematic if your tasks take a long time to 
complete, so you may wish to rework that state in terms of re-submitting 
instead of rejecting the message. As long as you have heartbeats 
enabled, your consumer channel shouldn't be closed, but until you've 
ack'ed the message one way or another, other consumers could 'get' it 
and therefore you'll need to make them idempotent to deal with this.

What ever you choose to do, the database needs to be properly updated 
when a task does finally succeed. The fact that you *must* do this at 
some point already (in order for the UI to be consistent) means you 
already have a thread of identity, and therefore you should be able to 
use this to create idempotent consumers where duplicate tasks are 
potentially an issue.

> Oh, and with this example of the third-party web service another problem
> is knowing if a failure of this service is permanent or temporary.  I
> have not done this, but I'm tempted to have my workers pull the jobs off
> the queue and if the job fails for an unclear reason then ack the
> original job and then send it to a "try again later" queue and have
> separate workers handle those.
>

That sounds like a reasonable approach. It is somewhat similar to the 
approach I describe above.