[rabbitmq-discuss] Design for global data ingestion

Dustin dustink.ml at gmail.com
Wed Nov 13 19:40:13 GMT 2013

Thanks a ton everyone for the feedback and suggestions.  Unfortunately our
data warehouse is in the US, so I wouldn't be able to use one of our other

This really helps clear a lot of things up Simon. Now that I know a queue
can only use 1 CPU, I was able to quadruple my throughput in my load
testing by using the random exchange bound with 4 queues.  Unfortunately I
can't use the consistent hash exchange, as the routing keys aren't very
random, and our developers don't have time currently to write a random
value into a header that I can hash off of.  I wanted to use the consistent
hash exchange as it's included in the base install.

I didn't know I could define multiple upstreams with the same URI.  This
should solve the problem at hand.

Thanks again!


On Wed, Nov 13, 2013 at 6:12 AM, Laing, Michael
<michael.laing at nytimes.com>wrote:

> We found the AWS Tokyo region to be the best transit point to/from China,
> although that was a while ago.
> Tokyo has 3 zones which now is the primary reason we base clusters there.
> Singapore should be good to.
> ml
> On Wed, Nov 13, 2013 at 8:33 AM, Alvaro Videla <videlalvaro at gmail.com>wrote:
>> Hi,
>> I've been trying to setup something like this, more like an
>> experiment/research to see how far federation could be taken across AWS
>> availability zones, but I have to admit that I've been sidetracked by
>> various projects and conferences.
>> When I was living in China I remember that a company there used to have a
>> similar problem as you, when sending data to the US.
>> Considering that you have upstreams at various locations in the world,
>> could it be the case that you have better TCP bandwidth from China to
>> another non US location, say HK, Japan, Singapore, or Korea?
>> If that's the case you could perhaps try to form a different federation
>> graph that would allow you to achieve some max flow routing. Say publishing
>> from China to a country that has better bandwidth than from China to the
>> US, and then from that country to the US, or to a third country and so on.
>> I know this might sound nice theoretically but I haven't tried it. I throw
>> this out nonetheless, in case it might lead you to a solution.
>> Regards,
>> Alvaro
>> On Thu, Nov 7, 2013 at 6:47 PM, Dustin <dustink.ml at gmail.com> wrote:
>>> Hello All!
>>> I wanted to shoot this out to see if anyone has had any experience with
>>> using RabbitMQ for a mass global data ingestion pipeline.  A small
>>> disclaimer, I'm a total RMQ noob :)
>>> We currently have a fan-in design, where we have a single downstream 2
>>> node HA cluster in the same data center as our data warehouse.  We have
>>> around 22 upstreams (also 2 node HA clusters) located in datacenters all
>>> over the world.  The configuration is extremely simple.  We have a single
>>> direct exchange, which everything publishes to. Each application uses a
>>> specified routing key for that application.  We end up with queue per
>>> application (currently around 10).  We are running 3.0.0 on the downstream
>>> cluster (been waiting for a maintenance window to upgrade) and 3.1.5 on the
>>> upstreams.
>>> This design has held up well, and we are averaging around 20k/sec
>>> messages a day.
>>> We have ran into 2 problems which won't allow us to scale any further.
>>>  The first is the max bandwidth for a single TCP connection across the
>>> globe (specifically between the US and China).  The second is we have maxed
>>> out the CPU for the federation clients on the downstream (SSL is enabled,
>>> I'm not sure how much CPU overhead that adds).
>>> For the CPU issue, I figured the newly added federated queues would be a
>>> perfect solution to the problem.  I can setup additional Rabbits on the
>>> downstream side, setup the federation links, and have everything load
>>> balance nicely.  The only thing it doesn't address is the max bandwidth for
>>> a single TCP connection.  Because of our initial design, we would run into
>>> max bandwidth problems for each queue.
>>> Our current objective is to be able to send 20k/sec messages from each
>>> datacenter for a single application.  In China, the most we can do is
>>> around 2.5k/sec (ends up being around 1.6MB/sec, this is on a good day).
>>> Because this message load will be from a single application, with the
>>> current design, it will be tied to a single routing key.  So for China, I
>>> would need around 8 TCP connections to do this.
>>> For this use case, message order doesn't matter.  Does anyone have any
>>> ideas on how I can setup multiple federation links that will be load
>>> balanced?  Here are some ideas I have, but they all feel hacky.
>>> 1) On the upstreams, use a consistent hash exchange, with exchange to
>>> exchange bindings to 8 direct exchanges, which would be federated.
>>> 2) Run multiple instances of RMQ on the downstream machines, and use
>>> federated queues.  Total number of instances across all machines should be
>>> greater than 8.
>>> My apologies in advance if I'm missing something obvious.  Please let me
>>> know if I'm trying to fit a round peg in a square hole.  :)
>>> Thanks!
>>> -Dustin
>>> _______________________________________________
>>> rabbitmq-discuss mailing list
>>> rabbitmq-discuss at lists.rabbitmq.com
>>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> rabbitmq-discuss at lists.rabbitmq.com
>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131113/b6db824b/attachment.htm>

More information about the rabbitmq-discuss mailing list