[rabbitmq-discuss] Performance on ec2

Wed Dec 28 17:49:52 GMT 2011

Hi

We've been running rabbitmq for some time now in production and we are generally very happy with the way it's performing. As it happens we're about to start scaling our current mq setup violently, going from today's stable throughput of about 2500 messages per second to roughly 20 times that much, in the first step. Optimally, we wouldn't like to stop until we hit 100 000/s. [*1]

Now. Our servers live in the amazon cloud, our messages are in the order of 800 bytes, we use multiple mq servers in a cluster. At the moment we publish messages as durable, but we are optimally aiming to get them in as high availability with (at least) 4 ec2 instances in two different availability zones.

So... I did some tests. Testing was performed on 
	*my macbook pro 13, (2.7.0 R14B04)
	*2 x1.large instance (2core 8-ish GB) (2.6.1 R14B02 | 2.6.32-309-ec2 #18-Ubuntu SMP) and 
	*2 c1.xlarge (8 cores, 8-ish GB) (2.7.0 R14B02 | 2.6.35.14-103.47.amzn1.x86_64 #1 SMP). 

Here are the graphs. http://goo.gl/z3tgG . Everyone loves graphs. The rabbitmq server versions were 2.6.1 and 2.7.0. 

The loader process would load (publish if you like) 1M messages at about 800 bytes a piece and route it via a direct exchange using one of 360 routing keys to 24 queues. The drainer process would connect to all the 24 queues and drain messages of it with a prefetch of 500, acking every 200th message (with the multiple ack flag set).

The drain rate is what concerns me the most, so I'm going to focus on that.

This is what I got away with:
	* I can only ever hope for a drain rate of 8000 messages per second - when there's no incoming messages.
	* There's no real difference in using durable message or non-durable.
	* On the ec2 instances, I got only a minor speed increase when moving from 2 core machines to 8 core machines.
	* No real difference if drain instance connects to all 24 queues, or to a subset. (i.e. 3 instances connecting to queues 0-7, 8-15, 16-23 respectively)
	* No significant speed improvements from 2.6.1 to 2.7.0
	* Loading at max rate (25k/s) pretty much blocks draining (i got <100/s until the first loading instance had finished (!)) 
	* Serving 8000 messages per second causes 500% cpu load on first mq server. No significant load on drainer instances themselves.

My question is as simple as it is complex: 'Is that it?' What can I do to tweak these numbers? Massively. 

Kind reagards,
Srdan
burtcorp.com

[*1] Yes, we are a highly reasonable and humble bunch.