[rabbitmq-discuss] Design for global data ingestion

Wed Nov 13 19:40:13 GMT 2013

Thanks a ton everyone for the feedback and suggestions.  Unfortunately our
data warehouse is in the US, so I wouldn't be able to use one of our other
datacenters.

This really helps clear a lot of things up Simon. Now that I know a queue
can only use 1 CPU, I was able to quadruple my throughput in my load
testing by using the random exchange bound with 4 queues.  Unfortunately I
can't use the consistent hash exchange, as the routing keys aren't very
random, and our developers don't have time currently to write a random
value into a header that I can hash off of.  I wanted to use the consistent
hash exchange as it's included in the base install.

I didn't know I could define multiple upstreams with the same URI.  This
should solve the problem at hand.

Thanks again!

-Dustin

On Wed, Nov 13, 2013 at 6:12 AM, Laing, Michael
<michael.laing at nytimes.com>wrote:

> We found the AWS Tokyo region to be the best transit point to/from China,
> although that was a while ago.
>
> Tokyo has 3 zones which now is the primary reason we base clusters there.
>
> Singapore should be good to.
>
> ml
>
>
> On Wed, Nov 13, 2013 at 8:33 AM, Alvaro Videla <videlalvaro at gmail.com>wrote:
>
>> Hi,
>>
>> I've been trying to setup something like this, more like an
>> experiment/research to see how far federation could be taken across AWS
>> availability zones, but I have to admit that I've been sidetracked by
>> various projects and conferences.
>>
>> When I was living in China I remember that a company there used to have a
>> similar problem as you, when sending data to the US.
>>
>> Considering that you have upstreams at various locations in the world,
>> could it be the case that you have better TCP bandwidth from China to
>> another non US location, say HK, Japan, Singapore, or Korea?
>>
>> If that's the case you could perhaps try to form a different federation
>> graph that would allow you to achieve some max flow routing. Say publishing
>> from China to a country that has better bandwidth than from China to the
>> US, and then from that country to the US, or to a third country and so on.
>> I know this might sound nice theoretically but I haven't tried it. I throw
>> this out nonetheless, in case it might lead you to a solution.
>>
>> Regards,
>>
>> Alvaro
>>
>> On Thu, Nov 7, 2013 at 6:47 PM, Dustin <dustink.ml at gmail.com> wrote:
>>
>>> Hello All!
>>>
>>> I wanted to shoot this out to see if anyone has had any experience with
>>> using RabbitMQ for a mass global data ingestion pipeline.  A small
>>> disclaimer, I'm a total RMQ noob :)
>>>
>>> We currently have a fan-in design, where we have a single downstream 2
>>> node HA cluster in the same data center as our data warehouse.  We have
>>> around 22 upstreams (also 2 node HA clusters) located in datacenters all
>>> over the world.  The configuration is extremely simple.  We have a single
>>> direct exchange, which everything publishes to. Each application uses a
>>> specified routing key for that application.  We end up with queue per
>>> application (currently around 10).  We are running 3.0.0 on the downstream
>>> cluster (been waiting for a maintenance window to upgrade) and 3.1.5 on the
>>> upstreams.
>>>
>>> This design has held up well, and we are averaging around 20k/sec
>>> messages a day.
>>>
>>> We have ran into 2 problems which won't allow us to scale any further.
>>>  The first is the max bandwidth for a single TCP connection across the
>>> globe (specifically between the US and China).  The second is we have maxed
>>> out the CPU for the federation clients on the downstream (SSL is enabled,
>>> I'm not sure how much CPU overhead that adds).
>>>
>>> For the CPU issue, I figured the newly added federated queues would be a
>>> perfect solution to the problem.  I can setup additional Rabbits on the
>>> downstream side, setup the federation links, and have everything load
>>> balance nicely.  The only thing it doesn't address is the max bandwidth for
>>> a single TCP connection.  Because of our initial design, we would run into
>>> max bandwidth problems for each queue.
>>>
>>> Our current objective is to be able to send 20k/sec messages from each
>>> datacenter for a single application.  In China, the most we can do is
>>> around 2.5k/sec (ends up being around 1.6MB/sec, this is on a good day).
>>> Because this message load will be from a single application, with the
>>> current design, it will be tied to a single routing key.  So for China, I
>>> would need around 8 TCP connections to do this.
>>>
>>> For this use case, message order doesn't matter.  Does anyone have any
>>> ideas on how I can setup multiple federation links that will be load
>>> balanced?  Here are some ideas I have, but they all feel hacky.
>>>
>>> 1) On the upstreams, use a consistent hash exchange, with exchange to
>>> exchange bindings to 8 direct exchanges, which would be federated.
>>> 2) Run multiple instances of RMQ on the downstream machines, and use
>>> federated queues.  Total number of instances across all machines should be
>>> greater than 8.
>>>
>>> My apologies in advance if I'm missing something obvious.  Please let me
>>> know if I'm trying to fit a round peg in a square hole.  :)
>>>
>>> Thanks!
>>>
>>> -Dustin
>>>
>>>
>>> _______________________________________________
>>> rabbitmq-discuss mailing list
>>> rabbitmq-discuss at lists.rabbitmq.com
>>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>>
>>>
>>
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> rabbitmq-discuss at lists.rabbitmq.com
>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>
>>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131113/b6db824b/attachment.htm>