[rabbitmq-discuss] Looking for guidance on R14B04 vs. R13B03 performance

Wed Feb 15 17:36:41 GMT 2012

On 15/02/12 17:15, Matt Pietrek wrote:
> Thanks Simon. The 5% figure is useful for me.

Cheers. Bear in mind that's just my guess, and it's also dependent on 
your situation being CPU-bound - which it sounds like you aren't.

> Let me give you a more precise description of what I'm doing to get the
> 36 message/sec.
>
>     * RabbitMQ 2.71 on a 3-node cluster with mirrored queues, durable on
>       all nodes.
>     * Client is Python 2.6/Pika 0.9.5.
>     * Each message publish occurs in a transaction so that we can be
>       sure it's safely in RabbitMQ.
>     * All nodes are Ubuntu 10.04 VMs with 4GB RAM and 2 or 4 vCPUs.
>
>
> At the heart of things, we're driving a highly complex state machine
> that manages thousands of VMs and their associated state. Losing track
> of any state is prohibitively expensive to clean up manually. As such,
> all state is modeled in clustered databases and/or persistent messages
> in the message queue. We have to assume that a given client app instance
> (our management code) may be ungracefully terminated at any moment, so
> enough state must be modeled to let a new instance pick up and recover.
> If our database record indicates that a message has been sent, it better
> darn well be in the hands of the RabbitMQ broker, and not sitting in
> some Pika client-side queue.
>
> For this reasons, publisher-confirms are not particularly helpful - They
> assume that the client app will be around to resend the message if the
> message doesn't get confirmed. Similar story for batching messages. We
> have to know they've been sent, and

OK.

> we can't stall our state machine
> waiting for enough message to accumulate to publish multiple messages at
> once.

So I think this is really the key point. OK, I accept your use case :)

> My goal in my latest round of experiments is to see what the maximum
> throughput of a highly available system is in optimal circumstances.
> We're perfectly willing to spend the money on high end SSDs and
> networking equipment as necessary.
>
> To prototype what this perf level is, I've configured RabbitMQ with the
> MNESIA directory pointing to a ramdisk (/tmp).

I assume that this is not how you plan to go into production, you are 
just testing how fast you can go if disk speed is not an issue?

> I've configured all the
> VMs with VMXNET3 networking, am with 16K blocks, are seeing bandwidth of
> 130MB/sec between VMs in the cluster.

The real issue though is going to be latency from client to server. What 
does that look like?

> My test app simply writes 6 byte message, one at a time, as quickly as
> it can. In monitoring the cluster nodes, I'm seeing very low CPU usage,
> very few writes to the physical disk, and network operation rates of
> about 700/sec for the master node and 350/sec for the client node.
>
> In short, there's a bottleneck somewhere and it's not obvious where.
> I'll try your suggestion about replacing tx.commit. Any other insight or
> guidance would of course be very much appreciated. :-)

It sounds like you have already almost eliminated disk writes in 
practice, so my working assumption would be that it's client -> server 
latency that is your issue.

Cheers, Simon

-- 
Simon MacMullen
RabbitMQ, VMware