[rabbitmq-discuss] Design for global data ingestion

Laing, Michael michael.laing at nytimes.com
Wed Nov 13 14:12:05 GMT 2013


We found the AWS Tokyo region to be the best transit point to/from China,
although that was a while ago.

Tokyo has 3 zones which now is the primary reason we base clusters there.

Singapore should be good to.

ml


On Wed, Nov 13, 2013 at 8:33 AM, Alvaro Videla <videlalvaro at gmail.com>wrote:

> Hi,
>
> I've been trying to setup something like this, more like an
> experiment/research to see how far federation could be taken across AWS
> availability zones, but I have to admit that I've been sidetracked by
> various projects and conferences.
>
> When I was living in China I remember that a company there used to have a
> similar problem as you, when sending data to the US.
>
> Considering that you have upstreams at various locations in the world,
> could it be the case that you have better TCP bandwidth from China to
> another non US location, say HK, Japan, Singapore, or Korea?
>
> If that's the case you could perhaps try to form a different federation
> graph that would allow you to achieve some max flow routing. Say publishing
> from China to a country that has better bandwidth than from China to the
> US, and then from that country to the US, or to a third country and so on.
> I know this might sound nice theoretically but I haven't tried it. I throw
> this out nonetheless, in case it might lead you to a solution.
>
> Regards,
>
> Alvaro
>
> On Thu, Nov 7, 2013 at 6:47 PM, Dustin <dustink.ml at gmail.com> wrote:
>
>> Hello All!
>>
>> I wanted to shoot this out to see if anyone has had any experience with
>> using RabbitMQ for a mass global data ingestion pipeline.  A small
>> disclaimer, I'm a total RMQ noob :)
>>
>> We currently have a fan-in design, where we have a single downstream 2
>> node HA cluster in the same data center as our data warehouse.  We have
>> around 22 upstreams (also 2 node HA clusters) located in datacenters all
>> over the world.  The configuration is extremely simple.  We have a single
>> direct exchange, which everything publishes to. Each application uses a
>> specified routing key for that application.  We end up with queue per
>> application (currently around 10).  We are running 3.0.0 on the downstream
>> cluster (been waiting for a maintenance window to upgrade) and 3.1.5 on the
>> upstreams.
>>
>> This design has held up well, and we are averaging around 20k/sec
>> messages a day.
>>
>> We have ran into 2 problems which won't allow us to scale any further.
>>  The first is the max bandwidth for a single TCP connection across the
>> globe (specifically between the US and China).  The second is we have maxed
>> out the CPU for the federation clients on the downstream (SSL is enabled,
>> I'm not sure how much CPU overhead that adds).
>>
>> For the CPU issue, I figured the newly added federated queues would be a
>> perfect solution to the problem.  I can setup additional Rabbits on the
>> downstream side, setup the federation links, and have everything load
>> balance nicely.  The only thing it doesn't address is the max bandwidth for
>> a single TCP connection.  Because of our initial design, we would run into
>> max bandwidth problems for each queue.
>>
>> Our current objective is to be able to send 20k/sec messages from each
>> datacenter for a single application.  In China, the most we can do is
>> around 2.5k/sec (ends up being around 1.6MB/sec, this is on a good day).
>> Because this message load will be from a single application, with the
>> current design, it will be tied to a single routing key.  So for China, I
>> would need around 8 TCP connections to do this.
>>
>> For this use case, message order doesn't matter.  Does anyone have any
>> ideas on how I can setup multiple federation links that will be load
>> balanced?  Here are some ideas I have, but they all feel hacky.
>>
>> 1) On the upstreams, use a consistent hash exchange, with exchange to
>> exchange bindings to 8 direct exchanges, which would be federated.
>> 2) Run multiple instances of RMQ on the downstream machines, and use
>> federated queues.  Total number of instances across all machines should be
>> greater than 8.
>>
>> My apologies in advance if I'm missing something obvious.  Please let me
>> know if I'm trying to fit a round peg in a square hole.  :)
>>
>> Thanks!
>>
>> -Dustin
>>
>>
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> rabbitmq-discuss at lists.rabbitmq.com
>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>
>>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131113/68a0a3c0/attachment.htm>


More information about the rabbitmq-discuss mailing list