[rabbitmq-discuss] RabbitMQ 2.3.1 high CPU usage & disks operations
Raphael Simon
raphael at rightscale.com
Thu Apr 21 00:58:14 BST 2011
Hello all,
We recently moved from RabbitMQ 1.8 to 2.3.1 on production and since then we
have noticed that after a few days of operations the brokers will get into a
mode where they use a lot of cpu to the point where it gets impossible to
even use SSH on the machines. This seems to get triggered when many queues
get deleted. Inspecting the broker (OS) processes with etop indicates a lot
of reductions in (Erlang) processes that are currently executing
prim_file:drv_get_response. This is correlated by the collectd daemon
showing > 12,000 reads / sec and > 2000 writes /sec ! These disks access
don't end up hitting the actual disk (are served by the OS cache) hence the
high throughput.
It seems to always be the same processes causing this (around 20). Looking
at the state of one using sys:get_status returns [1]. Looks like a queue
process. There are about 9K queues on the brokers when this happens and ~60K
messages. Up to this happening the brokers hover around 40% cpu usage
peaking at 60% when the message store is combining the messages (about every
5 minutes with our throughput).
It's interesting that this seems to be caused by a small amount of queue
processes, always the same that come up in etop. Also the CPU graph is a
direct reflection of the disk ops graph.
So what are the scenarios that will cause a queue process to access the disk
in what looks like an infinite loop? Anything else that could be causing
this behavior?
Thanks!
--
Raphael.
[1]
{status,<0.17050.0>,
{module,gen_server2},
[[{{#Ref<0.0.1.210978>,fhc_handle},
{handle,{file_descriptor,prim_file,{#Port<0.29642>,8437}},
0,0,false,576,infinity,
[[<<192,0,0,0,0,0,64,144>>],
[<<128,0,0,0,0,0,64,144>>],
[<<0,0,0,0,0,0,65,166>>,
[<<218,154,254,188,128,203,224,...>>,
<<0,4,161,101,66,238,...>>]],
[<<192,0,0,0,0,0,64,143>>],
[<<128,0,0,0,0,0,64,...>>],
[<<0,0,0,0,0,0,...>>,
[<<134,78,165,193,...>>,<<0,4,161,...>>]],
[<<192,0,0,0,0,...>>],
[<<128,0,0,0,...>>],
[<<0,0,0,...>>,[<<"u<"...>>,<<...>>]],
[<<192,0,...>>],
[<<128,...>>],
[<<...>>|...],
[...]|...],
true,
"/var/lib/rabbitmq/mnesia/rabbit at broker1-2
/queues/FIN41VKU7DXMV7IZOJED8MUO/journal.jif",
[write,binary,raw,read],
[{write_buffer,infinity}],
true,true,
{1303,269920,172427}}},
{{#Ref<0.0.474.103189>,fhc_handle},
{handle,{file_descriptor,prim_file,{#Port<0.116464>,596}},
15487841,0,false,0,1048576,[],false,
"/var/lib/rabbitmq/mnesia/rabbit at broker1-2
/msg_store_persistent/0.rdq",
[raw,binary,read],
[{write_buffer,1048576}],
false,true,
{1303,269928,232634}}},
{'$ancestors',[rabbit_amqqueue_sup,rabbit_sup,<0.131.0>]},
{{"/var/lib/rabbitmq/mnesia/rabbit at broker1-2
/msg_store_persistent/0.rdq",
fhc_file},
{file,1,false}},
{fhc_age_tree,{2,
{{1303,269920,172427},
#Ref<0.0.1.210978>,nil,
{{1303,269928,232634},#Ref<0.0.474.103189>,nil,nil}}}},
{{"/var/lib/rabbitmq/mnesia/rabbit at broker1-2
/queues/FIN41VKU7DXMV7IZOJED8MUO/journal.jif",
fhc_file},
{file,1,true}},
{'$initial_call',{gen,init_it,6}}],
running,<0.10398.0>,[],
[{header,"Status for generic server <0.17050.0>"},
{data,[{"Status",running},
{"Parent",<0.10398.0>},
{"Logged events",[]},
{"Queued messages",[]}]},
{data,[{"State",
{q,{amqqueue,{resource,<<"/right_net">>,queue,
<<"nanite-rs-instan"...>>},
true,false,none,
[{<<"x-messag"...>>,signedint,86400000}],
<0.17050.0>},
none,false,rabbit_variable_queue,
{vqstate,{[],[]},
{0,{[],...}},
{delta,undefined,...},
{277,...},
{...},...},
{[],[]},
{[],[]},
undefined,
{1303269928237652,#Ref<0.0.1710.21035>},
{1303269928766020,...},
undefined,...}}]}]]}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20110420/cf1915c8/attachment-0001.htm>
More information about the rabbitmq-discuss
mailing list