[rabbitmq-discuss] Throughput observation with RabbitMQ-3.1.3 and Erlang R16B01 - Single Node and Cluster Node

Thu Aug 1 20:06:03 BST 2013

Hi There ,

I have done many small small tests to understand how the throughput of 
rabbitMQ node scales with respect to cores as well as number of producers 
and consumers. Here are my observations. Tests are for both single node as 
well as Cluster configuration. 

Based on this observations, first thing I would like to confirm if these 
are expected behaviour/results or I can do something more to improve 
throughput. Secondly there are some specific questions with respect to them 
so would help me to have clarification on them to continue further. 

Just so you know, I have also checked the performance statistics, which 
were published on rabbitMQ site for 2.8.1 version, and based on the points 
there I tried different things. Like changing prefetch_count values, 
non-persistent type of messages as well as persistent type of messages, 
DISK node and RAM node etc. etc.

My test configuration are as follows.

RabbitMQ version . 3.1.3 with Erlang version - R16B01

I have one virtual machine with 8 GB of RAM and 20 cores - This is 
dedicated for mainly rabbit nodes. 
I have another virtual machine with 8 GB of RAM and 20 cores - This is 
dedicated mainly my producer and consumers. Also my producer and consumer 
are single threaded type (using python pika library) so I have try to go 10 
producer and 10 consumer by giving each of them 1 dedicated core to find 
out system limit of RabbitMQ. I start linearly. first with 1P and 1C and so 
on.....

Since my interest is to benchmark performance and find system limits on 
RabbitMQ, I have simulated producer and consumer and basically there is no 
processing with messages after consumer receives the messages. This means, 
I believe that I am producing as fast as above configuration of VM supports 
as well as I am consuming as fast as I can.

I am using non-blocking method of connection (using select.connection) 
method of pika.
Message size is 100 bytes. I have configured 'direct' type of exchange
I also have enabled publisher confirm as well as consumer ack since I am 
interested in reliable delivery and confirmation of messages till 
application layer. Hence I explicitly use publisher confirm and consumer 
ack. 

Here comes my statistics

Test-1) With Single Node configuration:
    The maximum throughout I can get is around 5000 msg/sec - with 1 
publisher and 1 consumer, with prefetch_count = 0 
    Node type = Disk
    Both producer and consumer are given dedicated cores using linux 
taskset command. (if I leave core assignment on linux then throughput is 
only around 3500 msg/sec) Core assignment for Rabbit Nodes are left to the 
linux and not touched.

    Here the limiting factor is publisher since it loads it's assigned core 
to almost 100%. So to have better throughput , I started another producer 
assigning another core     and also started corresponding consumer with 
it's own dedicated core. Try to publish to same queue as well as different 
queue.
    But still I see the overall throughout  remains around the same value 
and increases little bit and its division is roughly as follows for each of 
the producer and receiver. 

    P1 and P2- publish rate - roughly 2500-2700 msg/sec. Same for 
consumption, which adds to the total of around 5000-5500 msg/sec.

    Even if I introduce prefetch_count value it hardly changes the 
throughput.
    Also, I tried with both persistent and non-persistent messages, 
throughput does not change much. It's almost the same as listed above with 
node of type DISK.

So from this it feels that maximum capacity of a single node of type DISK 
is limited to 5000 msg/sec when publisher confirm and consumer ack is 
enabled in this version. I thought main reason for this could be server 
latency. Is it correct understanding or I am missing something here to 
consider ? 

And my specific questions on this particular observations are as follows.
1) Is this expected behaviour on throughput scaling when number of producer 
and consumer increases linearly ? 
2) Can something be done to improve throughout with single node 
configuration without changing publisher confirm and comsumer ack 
configuration ( means keeping them enabled) ? 
3) How to calculate server latency in approximate way ? here I thought by 
adding round trip time (RTT) for both publisher confirm and consumer ack, 
one can get latency. Is this correct understanding ? What is the effective 
method to calculate RTT ? 

Test-2) With Cluster configuration: 
First I tried, Cluster with 1 DISK node and 1 RAM node

Here when my producer and receiver try to connect to DISK node, statistics 
are almost similar to Test-1. 
I tried with single producer-single consumer, 2 publisher and 2consumer and 
so on. Not any observable diff in throughput 

Now when my producer and consumer connect to RAM node, I see following.
1 P and 1 C - throughout is around 4500-5000 msg/sec
2P and 2C - throughout is around 9000-10000 msg/sec
beyond this if I increase producer and consumer throughput starts to drop 
little bit with overall throughput to 12000 msg/sec with each 
producer/consumer having 4000 msg/sec.
So again I feel, after certain number of producer and consumer, server 
latency do come into picture even for RAM type of node and slowly drops the 
throughout. 
Instead of having multiple of 5000 msg/sec for every increase in producer 
and consumer it becomes roughly 4000 , 3500 , 3000 msg/sec per 
producer-consumer pair.

After this I added third node, fourth node and so on in the cluster, All 
are of also of type RAM. And maximum throughout I can get is around 
22000-24500 msg/sec. 
Changing prefetch_count or delivery_mode ( from persistent to 
non-persistent and vice versa) do not really makes any big difference.

So then my specific questions on Test-2) observation are as follows.
1) Why there is no linear increase in throughput with DISK type of node as 
it's seen with RAM type of node ?
2)  At least for messages of type non-persistent I believe DISK type and 
RAM type should behave similar but they are not so what are the main 
difference in the way         DISK type and RAM type of node handles 
non-persistent messages ?
3) What can be done to improve throughput in both the Tests ?
4) Since I have VM with 20 cores dedicated for rabbitMQ execution, how can 
I load the CPU to it's limit ? with the current tests I can load CPU 
maximum to 800% with    above mentioned throughput. currently the limiting 
factor seems to be server latency so how to overcome that ? 

Best Regards,
Priyanki.   

On Tuesday, June 25, 2013 1:09:28 PM UTC+2, hyperthunk wrote:
>
> Reposting
>
> On 25 Jun 2013, at 09:38, Tim Watson <t... at rabbitmq.com <javascript:>> 
> wrote:
>
> What does your publishing code look like? The figures below are expected 
> in that the consumer can keep pace with the producer - it could hardly be 
> expected to consume faster than messages are arriving in the queue(s). So 
> the slowness is very likely on the producing side.
>
> Are you using persistent messages and either publisher confirms or 
> transactions? If so, how often are you waiting on confirms/commits?
>
> With the official clients we typically see avg rates of 50 - 60Khz with 
> non-persistent messages. Persistence slows things down a tad, as do 
> confirms (and more so transactions) but even with persistent messages and 
> confirms, rates >= 5Khz are expected. It /sounds/ like you might be 
> publishing persistent messages with confirms enabled and waiting for a 
> confirm (ack) from the broker for each message. That  involves disk I/O on 
> the server for each message plus network latency, effectively making 
> publishing synchronous (and very slow by comparison).
>
> Cheers,
> Tim
>
> On 25 Jun 2013, at 08:44, Priyanki Vashi <vashi.p... at gmail.com<javascript:>> 
> wrote:
>
> Hi there,
>
> I am doing a performance study of RabbitMQ-3.1.1 and this is my first time 
> to do such a performance study with any messaging broker :))-
>
> 1) I have thoroughly gone through rabbitMQ in action' and learnt important 
> concepts. 
>
> 2) Tried single node broker to get a feel of how it is working and then 
> set up a four node cluster (with two disk and two RAM type of node). Also 
> configured HAproxy TCP Load balancer so that I can just provide single port 
> to connect to the Cluster.
>
> 3) I am simulating producer and consumers through Python scripts ( using 
> Python-pika library methods to connect to server , publish subscribe etc.)
>
> 4) My scripts are working fine but where I am stuck is no matter what I do 
> my throughout is always 300 msg/sec. 
>
> 5) I have defines durable exchanges and queues
>
> My final requirement is to run atleast 10 to 15 producer and 60 to 70 
> consumer simultaneously and I want to start with linear increase in number 
> of producer and consumer so that I can make conclusions about  throughout, 
> fault handling, processor utilization etc. etc but I am seriously stuck now 
> after trying to start in initial steps only. This group's help would be 
> really appreciated. 
>
> I have started with following different scenarios but no matter what I do 
> my throughput is more or less remaining same (300 msg/sec) except for 
> Scenario-1
>
> Scenario-1
> -1 producer and No consumer and no queue binded to exchange
> -Producer is running in infinite loop and publishing to one fanout exchange
> - publisher/confirm disabled
> -Publisher rate - 6200 msg/sec ( checked through web management plugin)
>
> Tried scenario-1 with also fanout type of exchange and it's the same 
> publish rate
> I know that Scenario-1 is not really useful, since there are no queues and 
> ultimately messages will be dropped but as a part of debugging process I 
> tried this and I see above mentioned results.
>
> Scenario-2
> -1 producer and 1 consumer 
> -Producer is running in infinite loop and publishing to one direct exchange
> -A consumer has it's own dedicated queue and listening to above exchange 
> - publisher/confirm and consumer ack are disabled
> Throughput - 300 msg/sec ( which is basically publish rate = 300 msg/sec 
> and deliver rate - 300 msg/sec)
>
> Tried Scenario-2 also with fanout type of exchange and enabling publisher 
> confirm and consumer ack
> Still the same throughput as 300 msg/sec
>
> Scenario-3
> -1 producer and 4 consumer 
> -Producer is running in infinite loop and publishing to four direct 
> exchange
> -A consumer has it's own dedicated queue and listening to respective 
> exchange 
> - publisher/confirm and consumer ack are disabled
> Throughput - 300 msg/sec ( which is basically publish rate = 300 msg/sec 
> and deliver rate - 300 msg/sec)
>
> Tried Scenario-3 also with fanout type of exchange and enabling publisher 
> confirm and consumer ack
> Still the same throughput as 300 msg/sec
>
> Tried configuring prefetch_count parameters also to 100 but it still gives 
> me same throughput of 300 msg/sec.
> I am honestly going crazy with this.
>
> After seeing this behavior, I am seriously suspecting that there is some 
> serious limitation with my simulated producers and consumers.
> Has anyone else has tried Python-pika client and any idea on throughput 
> with this version of rabbit ? 
> Did anyone have rough idea about throughout with rabbitMQ-3.1.1 ?
>
> I can also share my python scripts if required but I would really 
> appreciate some light on this situation
> Also what points to take care, in order to improve throughput ?
>
> Best Regards,
> Priyanki
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq... at lists.rabbitmq.com <javascript:>
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130801/5874b5ed/attachment.htm>