[rabbitmq-discuss] "Dead" beam.smp threads under high load
Paul Bowsher
paul.bowsher at gmail.com
Tue Oct 8 11:21:47 BST 2013
Hi,
Under highish loads (upwards of 2,000 messages/s), RabbitMQ requires more
than one of our CPU cores, which is fine. However, after a long
(undetermined) period of uptime of the Erlang VM these scheduler threads
seem to become unused. Looking in top/htop with thread views enabled, only
one thread (always the same thread) is used and is constantly pegged at 99%
of a core. The other threads barely reach 0.1%. We run the Erlang VM with
default flags, e.g. no +S or +s options. Some information about the
schedulers and CPU bindings etc:
# rabbitmqctl eval 'erlang:system_info(schedulers_online).'
24
# rabbitmqctl eval 'erlang:system_info(schedulers).'
24
# rabbitmqctl eval 'erlang:system_info(cpu_topology).'
[{node,[{processor,[{core,[{thread,{logical,1}},{thread,{logical,13}}]},
{core,[{thread,{logical,3}},{thread,{logical,15}}]},
{core,[{thread,{logical,5}},{thread,{logical,17}}]},
{core,[{thread,{logical,7}},{thread,{logical,19}}]},
{core,[{thread,{logical,9}},{thread,{logical,21}}]},
{core,[{thread,{logical,11}},{thread,{logical,23}}]}]}]},
{node,[{processor,[{core,[{thread,{logical,0}},{thread,{logical,12}}]},
{core,[{thread,{logical,2}},{thread,{logical,14}}]},
{core,[{thread,{logical,4}},{thread,{logical,16}}]},
{core,[{thread,{logical,6}},{thread,{logical,18}}]},
{core,[{thread,{logical,8}},{thread,{logical,20}}]},
{core,[{thread,{logical,10}},{thread,{logical,22}}]}]}]}]
# rabbitmqctl eval 'erlang:system_info(logical_processors_online).'
24
# rabbitmqctl eval 'erlang:system_info(multi_scheduling).'
enabled
# rabbitmqctl eval 'erlang:system_info(schedulers).'
24
# rabbitmqctl eval 'erlang:system_info(scheduler_bindings).'
{unbound,unbound,unbound,unbound,unbound,unbound,unbound,unbound,unbound,
unbound,unbound,unbound,unbound,unbound,unbound,unbound,unbound,
unbound,unbound,unbound,unbound,unbound,unbound,unbound}
# rabbitmqctl eval 'erlang:system_info(threads).'
true
# rabbitmqctl eval 'erlang:system_info(thread_pool_size).'
30
beam command line:
/usr/lib64/erlang/erts-5.10.1/bin/beam.smp -W w -K true -A30 -P 1048576 --
-root /usr/lib64/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa
/usr/lib/rabbitmq/lib/rabbitmq_server-3.1.0/sbin/../ebin -noshell -noinput
-s rabbit boot -sname rabbit at rabbit-node-name -boot start_sasl -config
/etc/rabbitmq/rabbitmq -kernel inet_default_connect_options
[{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false
-rabbit error_logger {file,"/var/log/rabbitmq/rabbit at rabbit-node-name.log"}
-rabbit sasl_error_logger
{file,"/var/log/rabbitmq/rabbit at rabbit-node-name-sasl.log"} -rabbit
enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir
"/usr/lib/rabbitmq/lib/rabbitmq_server-3.1.0/sbin/../plugins" -rabbit
plugins_expand_dir
"/var/lib/rabbitmq/mnesia/rabbit at rabbit-node-name-plugins-expand" -os_mon
start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false
-mnesia dir "/var/lib/rabbitmq/mnesia/rabbit at rabbit-node-name"
htop screenshot (at not-quite-full load): http://i.imgur.com/fom6Cwa.png
After this point of only one core being utilised and high loads being
encountered, message throughput hits a ceiling, the run_queue grows and the
Management API becomes unresponsive (we use it for a lot of monitoring).
To rectify the situation, we attempted first to do a rabbitmqctl stop_app,
rabbitmqctl start_app (our nodes are clustered with mirrored queues) but
this didn't help. In the end we shut down the app and restarted the Erlang
VM as a whole. Suddenly we see 6-8 threads all using about 70% CPU,
throughput increases to where it should be, run_queue is always 0 and the
Management API is fully responsive.
We currently have 4 nodes in this "stuck" situation on our less-critical
workloads, so we are able to provide any debugging information required.
We're running 24 "cores" worth of Xeon E5645 on
RHEL5.6 2.6.18-238.27.1.el5. We're running RabbitMQ both 3.1.0 and 3.1.5 on
a self-compiled RPM of Erlang OTP R16B with HiPE disabled.
Thanks in advance for any help, and let me know if we can provide any
further information, straces, netstats etc.
Paul Bowsher
Senior Engineer
Global Personals
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131008/1a0d2f52/attachment.htm>
More information about the rabbitmq-discuss
mailing list