[rabbitmq-discuss] Application architecture question: queue failure

Mon Jun 18 16:37:42 BST 2012

On Wed, Jun 13, 2012 at 3:53 AM, Tim Watson <tim at rabbitmq.com> wrote:

>
>> Anyway, sorry if this is a mundane (if not a bit off-topic) question --,
>>
>> and I know it's application-specific. But, it's a question that comes up
>> often in our design discussions.
>>
>> Do you have these concerns and how do you handle the possibility of
>> message or queue loss?
>>
>>
> Well rabbit fight's like a cornered... rabbit ... to make sure this
> doesn't happen! ;)
>
> Please feel free to elaborate on your concerns and questions, as that's
> what the list is for! I'd certainly like to understand a bit more about how
> your application works, what constitutes a job and how these are identified
> throughout the system. I often find the issue of identity can be
> particularly vexing in any non-trivial architecture.
>

Let me give you an example -- which is an actual workflow we have.

In our web app a user can select to receive a report.  In the web app we
want the user to see that the report is indeed queued, so in the database
we set a flag saying that the job was sent, and when.   This allows us to
display "pending" so the user doesn't submit the request multiple times.

The web app queues the message for the background report generation.
Anything is possible -- so imagine first that the message is somehow lost.
 The web app is still showing "pending" to the user.

But, we do want the task to complete -- it's a revenue generator, for
example.   So, one option is to use cron to look for stale "pending"
request on the web side and assume the message was lost and re-queue.
But, after X attempts maybe the cron job decides to give up.

Now, this report generation actually uses a third-party web service, and
this web service has gone down for extended periods for maintenance.  So,
in this case the report request jobs stack up in the queue.

So, if it's down long enough then cron might run again and re-queue the
same job that is already in the queue.   What I have done for this is
atomically change the state from "pending" to "in process" so that only one
message gets processed.  But, using some kind of UUID and a store is
another option, of course.

Maybe you are right that durable queues are the correct solution for this
-- I still need to track state on the web app side to show "pending" or "in
process".   And maybe just use cron to report/clean up any stale pending
job on the web app side.

I'm just curious if the above is a common design pattern when using
RabbitMQ in this way.  Obviously, depends on the specifics of the task, but
we seem to have quite a few situations like this.

Oh, and with this example of the third-party web service another problem is
knowing if a failure of this service is permanent or temporary.  I have not
done this, but I'm tempted to have my workers pull the jobs off the queue
and if the job fails for an unclear reason then ack the original job and
then send it to a "try again later" queue and have separate workers handle
those.

Thanks,

-- 
Bill Moseley
moseley at hank.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120618/6730a0ac/attachment.htm>