[rabbitmq-discuss] Getting maximum performance from RabbitMQ cluster

Mon May 5 04:25:44 BST 2014

I'm running series of performance tests on clustered RabbitMQ 3.1.5 and would
like to share and validate my test results. 

In current test I'm running single-threaded Java clients publishing
non-persistent messages of a fixed size (~200 bytes) as fast as they can via
single connection/channel (per thread) to a direct exchange on a target
Rabbit node. Exchange has multiple queues bound, but there is always exactly
one queue matching the routing key used by the publisher. Queues are created
on different nodes in cluster, so there is a choice to publish to "master"
or "slave" node. Because I'm not interested in consumers (yet) and want to
avoid memory flow control, all queues are size bounded with "x-max-length:
100" (if there are any performance implications of this, please let me
know!). All non-mentioned parameters are using default settings.

I have two Rabbit nodes in a current cluster, each box has 16 CPU cores. Can
provide spec details if needed.

First of all I wanted to achieve best publish TPS numbers, where I'd max out
Rabbit CPU while keeping cluster stable. For that I setup non-mirrored
queues and configured publishers to always use master node (where
destination queue was created). Since each thread is publishing with same
unique routing key (i.e. to the same queue), it's expected that flow control
will throttle each thread at certain (about same across all threads) rate
and total throughput will scale linearly with thread count until Rabbit runs
out of CPU. The expectation was confirmed by tests:

1 threads - ~18K/s
2 threads - ~40K/s
4 threads - ~80K/s
8 threads - ~130K/s
16 threads - ~160K/s

Note, that throughput reached it's ceiling around 8-9 threads, while total
number of CPU cores is 16. CPU on Rabbit was 80-95% (total across all cores,
constantly going up and down) busy at 8 threads and at 97-99% at 16
publishing threads. In all tests management ui was showing "flow" for all
publishing connections.

Running the same test in parallel against both nodes shows that (so far)
cluster throughput scales almost linearly with number of nodes:

4 + 4 threads - ~153K/s
8 + 8 threads - ~260K/s
16 + 16 threads - ~310K/s

To find the point of complete CPU saturation, I've added publish rate
throttling and tuned it so that flow control almost never kicks in (on 1
thread test). Running with that, I got Rabbit CPU constantly 99% busy at 9
threads test with same (as in the unlimited publish test) ~130K/s total
publish rate. Limiting publish rate per thread allowed to evenly load Rabbit
cores, but didn't improve the overall throughput, so I'm wondering if there
are any other benefits in explicitly limiting publish rate instead of
letting per-connection flow control do that?

Next I've tried to publish to a slave node (member of cluster, which doesn't
host the non-mirrored queue). In these tests the throughput will only scale
up to ~20K/s and remain roughly the same from 2 threads and above (with
slave node running at ~13-20% CPU and master at ~4-7%). It looks like the
flow control in this case is shared between all threads publishing to slave
node (which share slave-master connection to deliver messages). Running
parallel threads publishing to master confirms that they are throttled at an
independent, same as in a baseline test rate. This suggests that for best
performance, publishers must be aware of queue master node and use it at all
times. Which seems to be non-trivial given that publishers usually only
aware of exchange and routing key, while queues could be redeclared by
clients at runtime at any nodes in the cluster (in case of node outage). Is
there any good reading on how to address this problem? 

Next I've tried publishing to HA queues (set_policy ha-all). This proved to
be limiting the throughput and horizontal scaling even further - max
throughput went down to ~8-9K/s achieved on 1 thread and remained the same
with more threads added. Removing HA policy on selected queues during the
test unblocks affected publishers within ~3 seconds to their baseline rate,
while keeping others throttled heavily. This suggests that all HA queues are
sharing flow control threshold between all their publishers, is this
correct?

Furthermore, I've noticed in my tests that the rate at which HA publishers
could be impacted by non-HA publishers depends on which node they are
pointing to. This was consistent through multiple retries, but I'm not sure
if it's intended or a bug. Below is best description of the process I can
get so far for this test case:

    1) Running 2 publishers to HA queues and 2 publishers to non-HA queues
(all publishing to the same node, all queues are different and 'owned' by
this node):
     - non-HA queues throughput is same as baseline (~16K/s per queue)
     - HA queues throughput is throttled based on assumingly HA-shared
threshold (~4.5K/s per thread)

    2) Running 2 publishers to HA queues on node 1 and 2 publishers to
non-HA queues on node 2 (all queues are different and 'owned' by the node
being published to):
     - non-HA queues throughput is same as baseline (~16K/s per queue)
     - HA queues throughput is throttled as if there were 4 threads
publishing to HA queues! (~2K/s) Flow control bug?

It's obvious that for high performance/scalability HA queues are not good
(currently we run all queues in production under HA policy). I'm going to
add consumers next.

--
View this message in context: http://rabbitmq.1065348.n5.nabble.com/Getting-maximum-performance-from-RabbitMQ-cluster-tp35347.html
Sent from the RabbitMQ mailing list archive at Nabble.com.