[rabbitmq-discuss] Exactly Once Delivery

Thu Aug 5 12:22:52 BST 2010

Hi Mike,

On Tue, Aug 03, 2010 at 04:43:56AM -0400, Mike Petrusis wrote:
> In reviewing the mailing list archives, I see various threads which state that ensuring "exactly once" delivery requires deduplication by the consumer.  For example the following:
> 
> "Exactly-once requires coordination between consumers, or idempotency,
> even when there is just a single queue. The consumer, broker or network
> may die during the transmission of the ack for a message, thus causing
> retransmission of the message (which the consumer has already seen and
> processed) at a later point."  http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2009-July/004237.html
> 
> In the case of competing consumers which pull messages from the same queue, this will require some sort of shared state between consumers to de-duplicate messages (assuming the consumers are not idempotent).   
> 
> Our application is using RabbitMQ to distribute tasks across multiple workers residing on different servers, this adds to the cost of sharing state between the workers. 
> 
> Another message in the email archive mentions that "You can guarantee exactly-once delivery if you use transactions, durable queues and exchanges, and persistent messages, but only as long as any failing node eventually recovers."

All the above is sort of wrong. You can never *guarantee* exactly once
(there's always some argument about whether receiving message duplicates
but relying on idempotency is achieving exactly once. I don't feel it
does, and this should become clearer as to why further on...)

The problem is publishers. If the server on which RabbitMQ is running
crashes, after commiting a transaction containing publishes, it's
possible the commit-ok message may get lost. Thus the publishers still
think they need to republish, so wait until the broker comes back up and
then republishes. This can happen an infinite number of times: the
publishers connect, start a transaction, publish messages, commit the
transaction and then the commit-ok gets lost and so the publishers
repeat the process.

As a result, on the clients, you need to detect duplicates. Now this is
really a barrier to making all operations idempotent. The problem is
that you never know how many copies of a message there will be. Thus you
never know when it's safe to remove messages from your dedup cache. Now
things like redis apparently have the means to delete entries after an
amount of time, which would at least allow you to avoid the database
eating up all the RAM in the universe, but there's still the possibility
that after the entry's been deleted, another duplicate will come along
which you now won't detect as a duplicate.

This isn't just a problem with RabbitMQ - in any messaging system, if
any message can be lost, you can not achieve exactly once semantics. The
best you can hope for is a probability of a large number of 9s that you
will be able to detect all the duplicates. But that's the best you can
achieve.

Scaling horizontally is thus more tricky because, as you say, you may
now have multiple consumers which each receive one copy of a message.
Thus the dedup database would have to be distributed. With high message
rates, this might well become prohibitive because of the amount of
network traffic due to transactions between the consumers.

> What's the recommended way to deal with the potential of duplicate messages?  

Currently, there is no "recommended" way. If you have a single consumer,
it's quite easy - something like tokyocabinet should be more than
sufficiently performant. For multiple consumers, you're currently going
to have to look at some sort of distributed database.

> Is this a rare enough edge case that most people just ignore it?

No idea. But one way of making your life easier is for the producer to
send slightly different messages on every republish (they would still
obviously need to have the same msg id). That way, if you detect a msg
with "republish count" == 0, then you know it's the first copy, so you
can insert async into your shared database and then act on the message.
You only need to do a query on the database whenever you receive a msg
with "republish count" > 0 - thus you can tune your database for
inserts and hopefully save some work - the common case will then be the
first case, and lookups will be exceedingly rare.

The question then is: if you've received a msg, republish count > 0 but
there are no entries in the database, what do you do? It shouldn't have
overtaken the first publish (though if consumers disconnected without
acking, or requeued messages, it could have), but you need to cause some
sort of synchronise operation between all the consumers to ensure none
are in the process of adding to the database - it all gets a bit hairy
at this point.

Thus if your message rate is low, you're much safer doing the insert and
select on every message. If that's too expensive, you're going to have
to think very hard indeed about how to avoid races between different
consumers thinking they're both/all responsible for acting on the same
message.

This stuff isn't easy.

Matthew