[rabbitmq-discuss] Design for global data ingestion

Thu Nov 7 17:47:40 GMT 2013

Hello All!

I wanted to shoot this out to see if anyone has had any experience with
using RabbitMQ for a mass global data ingestion pipeline.  A small
disclaimer, I'm a total RMQ noob :)

We currently have a fan-in design, where we have a single downstream 2 node
HA cluster in the same data center as our data warehouse.  We have around
22 upstreams (also 2 node HA clusters) located in datacenters all over the
world.  The configuration is extremely simple.  We have a single direct
exchange, which everything publishes to. Each application uses a specified
routing key for that application.  We end up with queue per application
(currently around 10).  We are running 3.0.0 on the downstream cluster
(been waiting for a maintenance window to upgrade) and 3.1.5 on the
upstreams.

This design has held up well, and we are averaging around 20k/sec messages
a day.

We have ran into 2 problems which won't allow us to scale any further.  The
first is the max bandwidth for a single TCP connection across the globe
(specifically between the US and China).  The second is we have maxed out
the CPU for the federation clients on the downstream (SSL is enabled, I'm
not sure how much CPU overhead that adds).

For the CPU issue, I figured the newly added federated queues would be a
perfect solution to the problem.  I can setup additional Rabbits on the
downstream side, setup the federation links, and have everything load
balance nicely.  The only thing it doesn't address is the max bandwidth for
a single TCP connection.  Because of our initial design, we would run into
max bandwidth problems for each queue.

Our current objective is to be able to send 20k/sec messages from each
datacenter for a single application.  In China, the most we can do is
around 2.5k/sec (ends up being around 1.6MB/sec, this is on a good day).
Because this message load will be from a single application, with the
current design, it will be tied to a single routing key.  So for China, I
would need around 8 TCP connections to do this.

For this use case, message order doesn't matter.  Does anyone have any
ideas on how I can setup multiple federation links that will be load
balanced?  Here are some ideas I have, but they all feel hacky.

1) On the upstreams, use a consistent hash exchange, with exchange to
exchange bindings to 8 direct exchanges, which would be federated.
2) Run multiple instances of RMQ on the downstream machines, and use
federated queues.  Total number of instances across all machines should be
greater than 8.

My apologies in advance if I'm missing something obvious.  Please let me
know if I'm trying to fit a round peg in a square hole.  :)

Thanks!

-Dustin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131107/b6985951/attachment.htm>