I mentioned some performance issues I experienced over the weekend with STOMP on Twitter and @monadic asked me to send an email, so here it is.<br><br>For a recent internal hackathon project, I hooked some of our logs to RabbitMQ so a team of engineers could process events in real time. Having done this in the past, I wrote a simple perl program to tail the logfile in question and put each message on the queue (e.g. /queue/logs). The messages are about 1100 bytes on average, including the usual apache style log stuff and a serialized object, mime-encoded. I was capturing logs from 8 machines that are load balanced, so the rate of messages was pretty even.<br>
<br>Everything seemed to be working fine when I set it up. Messages were going in and coming out just fine at 5000-6000 messages/s (both sides stomp).<br><br>Once the engineer I was helping started running his stuff against RMQ, we started to notice that there was anywhere from 30-600 seconds of lag between the logs and his client (a dead-simple Ruby app using net-stomp). To make things extra fun, our traffic rose and the message rate rose with up to ~14km/s. We instrumented my tailing code and his ruby code and couldn't find any issue with either. I started watching the RabbitMQ instance, which is running in EC2 on a m2.xlarge (in retrospect, not the best instance choice). I could see both vcpus were pretty busy hovering at 6-10% idle with 12-20% system. I pulled up top with thread view enabled and could see two threads were pegging the CPU. I assumed, but did not verify, that these were running the STOMP code. At the same time all of this was happening, we were watching message rates in the admin webui. When I compared numbers in the UI to what my producers were sending, there was a large mismatch that correlated with the delay we were seeing at the consumers. The memory usage of erlang on the RMQ host was growing well past the 40% mark, so we bumped that to 75%, which simply allowed more time before it blew up.<br>
<br>My suspicion is that the STOMP plugin is getting backed up, based on these observations:<br> * memory usage regularly maxed out under load (4,000-14,000 messages/second)<br> * the AMQP queue stats did not match what the producers sent<br>
* the amount of memory consumed was way out of order from what the AMQP queue depths were (usually close to 0!)<br> * we were definitely consuming fast enough with 3-8 consumer processes on dedicated machines (4x m2.xlarge)<br>
* these machines/processes were showing no stress<br> * after shutting down producers, it appeared as if they were producing for up to 10 minutes after shutdown<br><br>While under the gun, we tried a few quick & dirty hacks:<br>
<br> * dropped every other log line in the producer to cut msg/s in half<br> * slowed performance decay but did not fix anything<br> * restarted the producer regularly to cycle connections<br> * made things worse - we could observe many draining producer channels in the admin UI that hung around for more than 10 minutes<br>
* after a while we bounced rabbitmq so we could move on<br> * thrashing seemed to make the existing producer channels drain even slower<br> * start more consumers - no change<br> * shut down producers<br> * only when I took it down to 1 producer did memory usage stop climbing, so ~1200 messages/s is the observed limit<br>
<br>Things I'd try if this project was still running, but have not:<br><br> * upgrade to Erlang R14<br> * switch to AMQP producers<br> * more/faster CPU instances<br><br>In any case, the experiment driving all of this is concluded for now. I can still fire up the producers and dummy consumers for quick tests but don't have a lot of time to dedicate to debugging this. For what it's worth, the hackathon project was super cool and successful; I just had to babysit the queue and fire up producers just before the demo started so delay would be acceptable ;)<br>
<br>-Al<br>