[rabbitmq-discuss] Consumer crash, redelivery and prefetch

Mon Mar 17 18:47:01 GMT 2014

Thanks for the replies all, so we are not alone with this feature
request.

On ven., 2014-03-14 at 08:28 -0400, Laing, Michael wrote: 
> It's a good topic. 
> 
> 
> In our std framework, based on python pika, a service may fail in
> processing a message due to an exception being raised - something
> unanticipated - the service will have chosen a default action to take
> in that case when it was initialized, typically 'reject'. Typically it
> will log a warning as well.
> 
> 
> We gather rejected messages in a 'reject' exchange and process them
> enough (via their headers) to route them back to their originators as
> well as to our own 'triage' queue.
> 

This dead-lettering back to the originator using a headers dead-letter
exchange is a really useful pattern for RPC over RabbitMQ, I was
surprised to not find any mention of this in the readings or ML on the
subject.

(Alas it requires duplicating the "replyTo" information (one time in the
amqp standard message property, and a second time in a non standard
header, as there is no RabbitMQ exchange routing on the "replyTo"
property), or not using the standard "replyTo" at all.)

> 
> Our messages all carry their processing history in their headers:
> region, zone, instance, pid, service, timestamp, etc. - again part of
> the framework.
> 
> 
> We also gather and coordinate the logs of all services on all
> instances.
> 
> 
> Additionally we replicate messages and process them in parallel
> through our Core clusters in multiple regions.
> 
> 
> A truly poison message will fail spectacularly everywhere. We have not
> actually encountered one yet in production. We do get them in staging,
> and bells go off everywhere.
> 
> 
> A failure of infrastructure will be localized to a region, zone,
> instance, or supporting service like Cassandra or the AWS control
> plane. Anticipated failures are retried. Unanticipated failures result
> in rejection of that message replica but other replicas should
> succeed. We do get these in production and can immediately tell where
> failures occurred and take appropriate action, e.g. shifting load away
> from failure if it has not yet taken place automatically.
> 
> 
> Of course it would be nice to get more info upon rejection. We
> compensate by creating context around rejection and coordinating the
> context in near real time across the nyt⨍aбrik.
> 

I think we will go with manual re-queueing (in the same queue) of
redelivered messages with a custom "redelivery-count" header manually
incremented, instead of currently just rejecting them.
Then, upon receiving a non-redelivered message, we reject it or not
according to the custom "redelivery-count" header.

It's a variation of the re-queueing to a "probably poison" second queue
technique mentioned earlier in this thread. Indeed I don't see why we
need a second queue when we can just modify a header and re-queue.

The only drawback is that we modify the message, which is avoided as
much as possible on the broker, but we are not on the broker so we can
do that without any issue.

Cheers,
Thomas