[rabbitmq-discuss] Looking for guidance on R14B04 vs. R13B03 performance

Wed Feb 15 17:15:44 GMT 2012

Thanks Simon. The 5% figure is useful for me.

Let me give you a more precise description of what I'm doing to get the 36
message/sec.

   - RabbitMQ 2.71 on a 3-node cluster with mirrored queues, durable on all
   nodes.
   - Client is Python 2.6/Pika 0.9.5.
   - Each message publish occurs in a transaction so that we can be sure
   it's safely in RabbitMQ.
   - All nodes are Ubuntu 10.04 VMs with 4GB RAM and 2 or 4 vCPUs.

At the heart of things, we're driving a highly complex state machine that
manages thousands of VMs and their associated state. Losing track of any
state is prohibitively expensive to clean up manually. As such, all state
is modeled in clustered databases and/or persistent messages in the message
queue. We have to assume that a given client app instance (our management
code) may be ungracefully terminated at any moment, so enough state must be
modeled to let a new instance pick up and recover. If our database record
indicates that a message has been sent, it better darn well be in the hands
of the RabbitMQ broker, and not sitting in some Pika client-side queue.

For this reasons, publisher-confirms are not particularly helpful - They
assume that the client app will be around to resend the message if the
message doesn't get confirmed. Similar story for batching messages. We have
to know they've been sent, and we can't stall our state machine waiting for
enough message to accumulate to publish multiple messages at once.

My goal in my latest round of experiments is to see what the maximum
throughput of a highly available system is in optimal circumstances. We're
perfectly willing to spend the money on high end SSDs and networking
equipment as necessary.

To prototype what this perf level is, I've configured RabbitMQ with the
MNESIA directory pointing to a ramdisk (/tmp). I've configured all the VMs
with VMXNET3 networking, am with 16K blocks, are seeing bandwidth of
130MB/sec between VMs in the cluster.

My test app simply writes 6 byte message, one at a time, as quickly as it
can. In monitoring the cluster nodes, I'm seeing very low CPU usage, very
few writes to the physical disk, and network operation rates of about
700/sec for the master node and 350/sec for the client node.

In short, there's a bottleneck somewhere and it's not obvious where. I'll
try your suggestion about replacing tx.commit. Any other insight or
guidance would of course be very much appreciated. :-)

Matt

On Wed, Feb 15, 2012 at 2:08 AM, Simon MacMullen <simon at rabbitmq.com> wrote:

> On 14/02/12 23:07, Matt Pietrek wrote:
>
>> I'm looking to squeeze every last bit of message throughput out of our
>> mirrored queue setup.
>>
>
> Hi Matt.
>
>
>  I max out at 36 messages/sec when writing. (Yes,
>> we use transactions, and we know it's not optimal, but we're valuing
>> high reliability over speed. Our clients and/or servers could go away at
>> any moment, so things like Publisher-acknowledge are nice, but take away
>> from the fundamental disaster hardness.)
>>
>> Anyhow, we're running on Ubuntu 10.04 with 2.71 and the default Erlang
>> install. This gives us R13B03 as the Erlang version.
>>
>> Question: Does anybody (and particularly the RabbitMQ folks) have any
>> input on what sort of perf improvements we might get by switching to a
>> newer Erlang Version?
>>
>
> Not much really. We haven't tested extensively but if you got 5% I think
> you would consider yourself lucky.
>
> However, 36 msg/s is *really* slow.
>
> Although mirrored queues are slower than normal ones I would still hope
> you could stick a couple of zeroes on the end of that figure. So I suspect
> you are not actually CPU bound at the server.
>
> I assume when you say "using transactions" you mean publishing a single
> message per transaction? In that case you are incurring the cost of a
> network round trip and an fsync() on each publish, and I suspect that is
> what is killing your performance.
>
> If I had to guess I would suspect the fsync is the worst of the problem.
> Assuming the 36 msg/s is from some test app, you might want to remove the
> tx.select and replace the tx.commit with some meaningless synchronous
> request (basic.qos is a favourite) and see what happens to your message
> rate. If it shoots up, the fsync is the problem. If it doesn't, the network
> round trip is.
>
> When you have the answer to that, you know whether you're going to want to
> spend money on fancy networking hardware or SSDs...
>
> Of course, *really* I would suggest you need to be publishing more
> messages per round-trip / fsync. So either batch multiple publishes into a
> transaction or start using confirms instead. I know you say that they "take
> away from the fundamental disaster hardness" but I'm not sure I understand
> what you mean by that. Both confirms and tx give you a way to know when a
> given message is definitely on disc. And in fact tx is implemented in terms
> of confirms these days.
>
> Cheers, Simon
>
> --
> Simon MacMullen
> RabbitMQ, VMware
> ______________________________**_________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.**rabbitmq.com<rabbitmq-discuss at lists.rabbitmq.com>
> https://lists.rabbitmq.com/**cgi-bin/mailman/listinfo/**rabbitmq-discuss<https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120215/eeec5ae9/attachment.htm>