[rabbitmq-discuss] Deduplication filters, and distributed key-value stores

Tue Nov 10 15:19:44 GMT 2009

We talk every now and then about deduplication filters, for avoiding
repeated work. One problem is that it can be hard to know what to do if
you have a worker pool, because you have to share the state about what
messages the whole pool has seen, across the whole pool.

Sounds like a job for a distributed key-value store. What if redis,
riak, even memcached, were used as the dedup filter? Each worker pool
(a.k.a "queue", heh) would have a key-value store of their own. The
store would hold, for each labelled request, either

 - nothing: the request is fresh to the pool

 - an indication of partial completeness and/or receipt

 - the response that was sent back to the requestor, that can be
   sent back again if a replay is detected, without doing any more
   work

The nice thing is that not only are these distributed key-value stores
pretty quick these days, they also scale up tremendously well and
furthermore you can "shard" them because you can scope them to the
collection of workers that are sharing a queue.

The only missing piece is clearing out old, definitely stale records
from the filter after a while. For that you can use a separate
garbage-collection/expiry process, I guess; I haven't run any
experiments here, though. Maybe some of these new stores even include
time-to-live for their records, which solves the problem for us!

So: rabbitmq for the routing, buffering, relaying and delivery of
requests and responses; and a distributed key-value store for deduplication.

Tony