<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div>Jesper,</div><div><br></div><div>This is totally awesome, thanks for all your efforts! Quite a few of these items we already have open bugs for, but we've not prioritised optimising the Erlang client up until now. I will however, be picking up these patches and looking at incorporating them. Is it ok if I contact you offline if and when I need to discuss the changes?</div><div><br></div><div>Cheers,</div><div>Tim</div><br><div><div>On 3 Sep 2013, at 13:35, Jesper Louis Andersen wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div dir="ltr"><div># Optimizing the Erlang RabbitMQ Client</div><div><br></div><div>I originally thought about this as a blog post, but as I don't really have time finishing it up and polishing it more, I'd just write it here on the mailing list instead. There are some patches in the mail you may want to grab and apply on the RabbitMQ Erlang client. In my benchmarks, they helped me get some more performance squeezed out of my systems in the ballpark of 15-40% more Messages per second.</div>
<div><br></div><div>My benchmark is quite synthetic and bad since it goes great lengths at playing on all of AMQPs weaknesses. But the patches are probably quite sound and picks out the lowhangingfruit there is in there. In the end, I have written some of the ideas I have for further improvement of the client. But I did not have the time to do so, and currently, it is fast enough for our purposes, our load being well under an order of magnitude lower than the limits it imposes on systems.</div>
<div><br></div><div>In any case, you may want to grab the patches I made since they would probably help any Erlang RabbitMQ client user. This message was due to me complaining to Alvaro `gifsockets` Videla about the Erlang client, and he urged me to write up my findings. So here goes nothing:</div>
<div><br></div><div>## The setup</div><div><br></div><div>The current setup is like this:</div><div><br></div><div>* Run a RabbitMQ server version 3.1.5.</div><div>* Run a fairly recent RabbitMQ erlang client. We use version 3.1.1, but there are so few changes on it that it doesn't matter it is not the newest version.</div>
<div>* We run everything on localhost.</div><div>* We run with Erlang/OTP R16B01</div><div>* We make two connections to RMQ</div><div>* One connection manages a "sender" channel, run by a single "sender" process.</div>
<div>* The other connection runs 10 channels, each channel having a "Handler function"</div><div><br></div><div>The setup is somewhat suboptimal. The same machine runs everything so there are certain nice optimization like setting scheduler bind types and so on which won't really help that much. Furthermore, we don't tune memory allocators, which I have a hunch could help in these situations.</div>
<div><br></div><div>A test is the following:</div><div><br></div><div>* We spawn 100 processes. These send 10000 messages each and wait for a reply.</div><div>* The Handlers reply via reply_to queues.</div><div>* All messages are very small. A few bytes. We want to test messaging capability.</div>
<div>* We thus run one million (10⁶) RPC queries. Each query is two messages: A request and a reply.</div><div>* The single "sender" multiplexes and demultiplexes the 100 worker processes. This is a known bottleneck in the code base.</div>
<div><br></div><div>## Initial state</div><div><br></div><div>I have a Core i7-3630QM Based laptop. Making sure it is not running on battery, we can process roughly 12.5k RPS (RPC's per second) on this setup. When running, it seems we max out all cores in the machine. Roughly 3 cores go to RMQ, 4 cores go to our test case and the last core does other stuff like playing spotify. My benchmark surely runs in a highly controlled environment :)</div>
<div><br></div><div>What is odd though, is the load of 400% on the Client Erlang node. Why on *earth* is the client doing more work than the server?</div><div><br></div><div>When debugging stuff like this, the first thing to look at is bottom of things. We must make sure the kernel and the network is behaving like we expect them to. Old trusty *tcpdump* always comes to the rescue for these kinds of things. We grab a typical run on disk just to see what is happening.</div>
<div><br></div><div><span class="" style="white-space:pre">        </span>tcpdump -i lo -w tcp.dump 'port 5672'</div><div><span class="" style="white-space:pre">        </span></div><div>Given such a trace, we can analyze it in tcptrace and in Wireshark. It can tell us if there are anything odd to look for in the handling of the low-level communication. Here is a simple breakdown of the packet sizes:</div>
<div><br></div><div><span class="" style="white-space:pre">                </span>Size<span class="" style="white-space:pre">        </span>Count<span class="" style="white-space:pre">                </span>Percent</div><div><span class="" style="white-space:pre">                </span>40-79<span class="" style="white-space:pre">        </span>417900<span class="" style="white-space:pre">                </span>12.38%</div>
<div><span class="" style="white-space:pre">                </span>[⋯]</div><div><span class="" style="white-space:pre">                </span>160-329<span class="" style="white-space:pre">        </span>2814258<span class="" style="white-space:pre">                </span>83.37%</div>
<div><br></div><div>This is *very* odd. A localhost interface has an MTU of 64 Kilobytes. Yet, most packets are extremely small. RabbitMQ disables Nagle's algorithm through setting *TCP_NODELAY* and it does not look like there is good handling of it. Most of the packets we transfer are pretty small. This is quite nasty for performance.</div>
<div><br></div><div>Wireshark tells us we can push around 65.5 megabit through the link. This isn't a lot on a localhost interface. I had a hunch something might be wrong with the TCP window, but it never closes according to Wireshark, so this can not be the case.</div>
<div><br></div><div>The major problem here is that we are not filling up packets for some reason and are sending way too small IP datagrams for this to ever be fast. We are dying by processing small packets.</div><div><br>
</div><div>## Eprof</div><div><br></div><div>Another useful approach is to use the *eprof* profiler. We hook amqp_client and change the main application start callback such that it will profile its own tree:</div><div><br>
</div><div><span class="" style="white-space:pre">        </span>start(_StartType, _StartArgs) -></div><div><span class="" style="white-space:pre">                </span>eprof:start_profiling([self()]),</div><div><span class="" style="white-space:pre">                </span>amqp_sup:start_link().</div>
<div><br></div><div>In our test case, we can now run a call to `eprof:stop_profiling()` followed by a call to `eprof:log/1` and `eprof:analyze()`. This only profiles the amqp_client, not our own test code which we use to load it. Hence, all the time reported is spent in the AMQP side of things. We could also try to optimize the test code, but for now, the focus is on the `amqp_client` side of things. We get a lot of interesting output, most notably one for a process doing a lot of work:</div>
<div><br></div><div><span class="" style="white-space:pre">        </span> ****** Process <0.184.0> -- 18.45 % of profiled time *** </div><div><span class="" style="white-space:pre">        </span> FUNCTION CALLS % TIME [uS / CALLS]</div>
<div><span class="" style="white-space:pre">        </span> -------- ----- --- ---- [----------]</div><div><span class="" style="white-space:pre">        </span> [⋯]</div><div><span class="" style="white-space:pre">        </span> amqp_main_reader:init/1 1 0.00 1 [ 1.00]</div>
<div><span class="" style="white-space:pre">        </span> [⋯]</div><div><span class="" style="white-space:pre">        </span> erlang:port_control/3 6000087 4.06 2124919 [ 0.35]</div><div><span class="" style="white-space:pre">        </span> prim_inet:ctl_cmd/3 6000087 4.16 2177725 [ 0.36]</div>
<div><span class="" style="white-space:pre">        </span> gen_server:dispatch/3 6000086 4.24 2223023 [ 0.37]</div><div><span class="" style="white-space:pre">        </span> prim_inet:enc_time/1 6000087 5.07 2656325 [ 0.44]</div>
<div><span class="" style="white-space:pre">        </span> erlang:send/3 3000040 6.39 3349936 [ 1.12]</div><div><span class="" style="white-space:pre">        </span> amqp_main_reader:handle_info/2 6000086 7.10 3717176 [ 0.62]</div>
<div><span class="" style="white-space:pre">        </span> erts_internal:port_control/3 6000087 38.26 20042440 [ 3.34]</div><div><br></div><div>This process runs around 20% of the load and runs roughly 6 million calls to `erts_internal:port_control/3`. This is odd. We would expect there to be fewer calls. Since we are passing both the request and the reply through this area in the code base, it seems odd that we are making 6 calls per RPC. In fact, the number shouldn't even be two! We are passing a *lot* of data through the socket so we expect Erlang to be able to pick up a lot of data and process it with a single call to the underlying socket. Not a call per message which is very expensive.</div>
<div><br></div><div>Looking into the `amqp_main_reader` module, we see that it handles the underlying socket in a very odd way. First, it reads exactly 7 bytes off the socket for the AMQP header. From this it obtains the payload length and then it reads the payload length. This means two context switches to get a single packet. And the reason why we are spending all the time in port control.</div>
<div><br></div><div>The optimization is straight forward: Read as much data as possible from the underlying socket per call. Then process as much data as you can before going to the socket again. There are basically two scenarios. Either the socket is not loaded at all and then there are no reason to even think about optimization. Or the socket is loaded in which case we should optimize for the fast path. The fast path in this case is that all data is available for a message. If not, we can just append data to the old data</div>
<div><br></div><div><span class="" style="white-space:pre">        </span><<OldData/binary, NewData/binary>></div><div><br></div><div>since in Erlang releases after OTP R12B this will use the extra capacity allocated for binaries and avoid copying in most cases anyhow. Also, when the socket is running slowly, a bit of extra copying like this is not going to be a problem. Whereas if the socket processes lots of data, we can fastpath the code. In other words, the natural implementation is reasonably fast. Implementing this change immediately pushes messages per second to 31k, so we are up 6000 or around 3000 RPC calls.</div>
<div><br></div><div>It also pushes the 6 million calls down to 6 * 10⁵ shaving off an order of magnitude calls. And it shuffles the load in the system to other parts for now. The main problem now is the small packets being pushed all the time, which has to do with the sendpath in RabbitMQ.</div>
<div><br></div><div><span class="" style="white-space:pre">        </span>[<a href="https://github.com/issuu/amqp_client/commit/d8b64beb31f0aa63df7624e240bc324a0ed3f8f8.patch">https://github.com/issuu/amqp_client/commit/d8b64beb31f0aa63df7624e240bc324a0ed3f8f8.patch</a>]</div>
<div><span class="" style="white-space:pre">        </span>[<a href="https://github.com/issuu/amqp_client/commit/5f01fd355e82a4e6555e646d4f9fb6a521ef1911.patch">https://github.com/issuu/amqp_client/commit/5f01fd355e82a4e6555e646d4f9fb6a521ef1911.patch</a>]</div>
<div><br></div><div>## Next optimizations</div><div><br></div><div>After these optimizations we have the following breakdown:</div><div><br></div><div><span class="" style="white-space:pre">        </span>amqp_channels_manager<span class="" style="white-space:pre">                </span>9.91% time</div>
<div><br></div><div>It uses its time for two things. First, for decoding incoming frames. And secondly on looking up and updating a `gb_trees` structure mapping to the channel.</div><div><br></div><div><span class="" style="white-space:pre">        </span>amqp_main_reader<span class="" style="white-space:pre">        </span>8.64% time</div>
<div><span class="" style="white-space:pre">        </span>amqp_main_reader<span class="" style="white-space:pre">        </span>7.43% time</div><div><br></div><div>Most of the time is now spent decoding and analyzing frames in the two main reader processes. They are not working on the socket and using around 20% cpu each. Rather, they are spending the time in the frame decoder loop, which is exactly where we want them to spend their time. Optimizing further is definitely a great idea, but Amdahl tells us there are other culprits where we can gain more by optimizing.</div>
<div><br></div><div><span class="" style="white-space:pre">        </span>amqp_channels_manager<span class="" style="white-space:pre">        </span>8.81% time</div><div><span class="" style="white-space:pre">        </span>amqp_direct_consumer<span class="" style="white-space:pre">        </span>4.32% time</div>
<div><span class="" style="white-space:pre">        </span>amqp_channel<span class="" style="white-space:pre">                        </span>13.11% time</div><div><span class="" style="white-space:pre">        </span>rabbit_writer<span class="" style="white-space:pre">                        </span>14.45% time</div>
<div><span class="" style="white-space:pre">        </span></div><div>This is around 66.6% of all time spent for the selected processes and their underlying modules. The remaining 33% are in the same modules, but for processes that spend less than 2% time each. For the second round, the major processes to look into is the channel itself and the rabbit_writer process. Perhaps looking at the latter can tell us why packets are so small. And looking into the former can explain why we spend so much time handling channels.</div>
<div><br></div><div>Let us look at what the `rabbit_writer` is doing. It manages writing of frames to the underlying socket. It can write in two modes, which are the ubiquituous async and sync variants. The idea is that you can buffer up writes before pushing them onto the socket, which is running in TCP_NODELAY mode. This means that we can get low latency operation, but it also requires us to handle buffering in our end. Otherwise the packets will get small. Who calls rabbit writer?</div>
<div><br></div><div><span class="" style="white-space:pre">        </span> # `gg` is an alias for [git grep -n "$@"]</div><div><span class="" style="white-space:pre">        </span> ; gg rabbit_writer:</div><div><span class="" style="white-space:pre">        </span> src/amqp_channel.erl:784: {network, none, _} -> rabbit_writer:send_command_sync(W, Method);</div>
<div><span class="" style="white-space:pre">        </span> src/amqp_channel.erl:785: {network, _, _} -> rabbit_writer:send_command_sync(W, Method,</div><div><span class="" style="white-space:pre">        </span> src/amqp_network_connection.erl:56: catch rabbit_writer:send_command_sync(Writer, Method).</div>
<div><br></div><div>Ouch! All writes are synchronous! This means we are forcing a flush to the underlying socket whenever we write something on it. And we are forcing a context switch from the `amqp_channel` code to the `rabbit_writer` all the time, even though we don't need to do so. This accounts for the small packets we are seeing when we are writing. The response is written on the socket and then it immediately pushes the result because of TCP_NODELAY. It also means each write results in a write to the underlying erlang `port()` which is quite expensive.</div>
<div><br></div><div><span class="" style="white-space:pre">        </span>[<a href="https://github.com/issuu/amqp_client/commit/42856090936e7196eab6c8e71107a479883f97a9.patch">https://github.com/issuu/amqp_client/commit/42856090936e7196eab6c8e71107a479883f97a9.patch</a>]</div>
<div><br></div><div>RabbitMQ itself does not use the sync writes for anything but closing down the line. So we can probably safely change to use async writes on the socket. The change severely cuts down on the amount of calls to the `port()` since they can now happen in larger batches.</div>
<div><br></div><div>The change does not help too much on the packet sizes however. Most packets are still small due to TCP_NODELAY and due to how the processing is happening on the interface. The main loop looks like this:</div>
<div><br></div><div><span class="" style="white-space:pre">        </span> mainloop1(State) -></div><div><span class="" style="white-space:pre">        </span> receive</div><div><span class="" style="white-space:pre">        </span> Message -> ?MODULE:mainloop1(handle_message(Message, State))</div>
<div><span class="" style="white-space:pre">        </span> after 0 -></div><div><span class="" style="white-space:pre">        </span> ?MODULE:mainloop1(internal_flush(State))</div><div><span class="" style="white-space:pre">        </span> end.</div>
<div><br></div><div>So if the mailbox is usually small, then we will often push out small packets. Looking at a running system through the `observer` application shows us that the main users of reductions in the system is currently `amqp_channels_mgr` and our muxing client we use for the test.</div>
<div><br></div><div>## Two major problems</div><div><br></div><div>The main problems at this point is twofold: We spend a lot of time decoding and encoding packets. And we spend an extremely high amount of time finding the correct channel endpoint for a message. The reason for the latter is simply the fact that every time we find an appropriate channel process, we also update the appropriate assembler state. And we run the decoder loop in the channels manager. </div>
<div><br></div><div>The assembler state is a protocol verifier. It tracks the protocol version and it checks the next frame matches the expected frame in question. This is of course a nice and necessary thing to do. But keeping all channels in the same process and managing the decoder loop like this is odd. We get no parallel execution on the connection and we have to update a state map for the assembler state all the time. Furthermore, by decoding first and then passing data, we will have to send more complex messages rather than simply a binary reference.</div>
<div><br></div><div>I would probably have chosen a design based on an ETS lookup table. And then have kept the state in the channel process itself to maximize parallel behaviour. In turn, we can avoid spending a lot of time updating internal memory state in a process on the critical path which would speed up the system by quite a lot. Also, we are paying reductions on busy channels. This will help fairness.</div>
<div><br></div><div>Also, this means we can do away with the channels_manager process entirely. There is no need since a decoded message can just be forwarded to the channel directly, skipping an unneeded message pass on the critical decoder path.</div>
<div><br></div><div>## More controversial patches to `rabbit_writer`</div><div><br></div><div>The writer process inside RabbitMQ is something I wanted to optimize as well, so I tried changing two things in it. First, I avoided a recomputation of `iolist_size/1` on every received message. And I just dropped keeping a small 1414 byte packet size around. The rationale is this: either we will be running at a high pace in the writer code and then easily exceed the 1414 byte packet size. In this case, bumping it to 64 kilobytes and making few calls to the underlying port seems to be the right thing to do. Otherwise, we will be running slowly, which means we will eventually be able to empty the mailbox of the writer process. When this happens, we will flush whatever is in our buffer anyway. The controversy is the fact that if we get into a situation where the writer is kept buffering all the time, then the 64 kilobyte window is quite large and may take time to fill. But I guess this won't happen since writing like this will be awfully fast and it is unlikely that you will be overwhelmed with exactly that kind of slow-but-steady trickling of messages in practice.</div>
<div><br></div><div><span class="" style="white-space:pre">        </span>[<a href="https://github.com/issuu/rabbit_common/commit/62104657aaf2189e35eb28f93fa2033b28973dde.patch">https://github.com/issuu/rabbit_common/commit/62104657aaf2189e35eb28f93fa2033b28973dde.patch</a>]</div>
<div><br></div><div>While here, I don't know why we are producing an iolist() in the reverse direction of what we need. Rather, we can just build up an iolist() tree and then send it out over the line. This avoids a reverse on the critical path and there is no reason to do that:</div>
<div><br></div><div><span class="" style="white-space:pre">        </span>[<a href="https://github.com/issuu/rabbit_common/commit/15c270fcb75ab758d312cf0e4d1d5864127f5244.patch">https://github.com/issuu/rabbit_common/commit/15c270fcb75ab758d312cf0e4d1d5864127f5244.patch</a>]</div>
<div><br></div><div>This patch is also slightly controversial since it may not be what you guys want to do. Furthermore, it did not increase the send and receive rates in a significant way in my tests.</div><div><br></div>
<div><br></div>-- <br>J.
</div>
_______________________________________________<br>rabbitmq-discuss mailing list<br><a href="mailto:rabbitmq-discuss@lists.rabbitmq.com">rabbitmq-discuss@lists.rabbitmq.com</a><br>https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss<br></blockquote></div><br></body></html>