[rabbitmq-discuss] Our fight against scammers

Fri Aug 8 19:52:15 BST 2008

First off, I'd like to thank you for this wonderful tool, it's being
put to good use!

I thought you might be interested in knowing how we're using it at my
company as part of a system to identify and rid ourselves of scammers.

It used to be that scammers were easy to spot. An account is created
and immediately mass messages a huge number of others. Suspicious.
Those accounts were flagged and banned.

Also, before I even started working here they'd taken a sort of
scorched earth approach to scam prevention. Scammers usually came from
ips in places like nigeria, south africa, etc so thanks to the wonders
of geoip location entire countries were banned. So the scammers got
smarter. Instead they created multiple accounts that waited a bit and
sent a small number of messages. Harder to spot but not impossible.
Multiple accounts created from the same ip aren't always scammers
(could simply be a nat) but they often are so they were flagged for
review. They also used proxies to create the accounts which finally
putting them under the radar entirely.

Here's where RabbitMQ comes in. I've just finished a new filtering
system where, each time a message is sent, the web server queues up
the sender's message history into RabbitMQ. At the other end I have a
consumer that takes the message, feeds it into crm114 and gets a
result. If it looked scammy I write a log entry in the database for a
moderator to later review. If it turns out to be a false positive it
puts the message history back into the queue, the consumer gets it
back and updates the "good" file. Whenever we have a human report or
other algorithmic matches (I also check message similarity for
possible template spam) that are confirmed by a moderator it puts the
message history into the queue and the consumer takes it out and
updates the "bad" file.

I tried out both the python py-amqplib and RabbitMQ java clients but
I'm currently using the java client because it's a whole lot faster.
Normally I would have much preferred to write my producer and consumer
in python but the speed increase was worth writing a little java code.
With the java client I'm getting incredibly good throughput, so much
so that I imagine we'll be able to stay with a single consumer for a
while. This system isn't live yet so the only throughput numbers I
have is running everything (the server, producer and consumer) on my
average-horsepower laptop but even if it were deployed to my laptop it
would probably be enough for now. (It won't be though we're getting it
a beefy server of course and I'll have a better idea of the real
numbers once the machine arrives.) On my laptop I can send ~8800
messages/sec and I can consume & process ~300 messages/sec.

The way I'm running everything right now is basically like this:

The producer is installed on the web servers and listens on a local
socket. It blindly forwards everything it gets to a RabbitMQ queue.
Basically I did it this way just because there aren't any php
libraries available.

When the php messaging module is hit, it delivers the message as
usual, but it also forwards the message history to the producer on a
local socket. (This step is most probably going to be changed to a
cron job before deployment but for now on the dev platform it's
running like this)

The consumer has its own box with crm114 installed. (The RabbitMQ
server will probably be installed on this box as well.) It waits on
the queue and processes the histories based on crm114 output.

Of course, using RabbitMQ for such a simple scenario is probably
incredible overkill (though not really when seen from an effort
perspective since it was actually reasonably easy to learn, set up and
get going) when I could've easily used something like beanstalkd given
that guaranteed delivery isn't all that important but this was really
just a proof-of-concept for a (probably sooner rather than) later
project of writing our own financial transaction processing that of
course needs to be much more rigorous.

Also, this many-producer (we have a bunch of web servers) one-listener
scenario is fine for now but later on we're most probably going to
need more than one box for message analysis. This is where I run into
a distribution problem and where crm114 will shine. For now what I
feed into the queue is an xml message with an action to be performed
(pick, learnbad, learngood), an optional member id and a message
history. Why this won't scale to many consumers is that crm114 keeps
its "good" and "bad" database as a statistics file on disk. Later,
when I need more consumer boxes I can easily refactor this to one box
that listens on a learn queue and, whenever it updates a file (these
are rare since it's human moderators that generate these updates),
drops the modified file into the queue that all pick consumers are
listening on and they can all replace the right statistics file by the
new one.

So all this to say that I am pleased to report that RabbitMQ is what
I'm going to highly recommend we build the financial processing
project around. Thanks a lot for the amazing work.