Hello all,<div><br></div><div>We recently moved from RabbitMQ 1.8 to 2.3.1 on production and since then we have noticed that after a few days of operations the brokers will get into a mode where they use a lot of cpu to the point where it gets impossible to even use SSH on the machines. This seems to get triggered when many queues get deleted. Inspecting the broker (OS) processes with etop indicates a lot of reductions in (Erlang) processes that are currently executing prim_file:drv_get_response. This is correlated by the collectd daemon showing > 12,000 reads / sec and > 2000 writes /sec ! These disks access don't end up hitting the actual disk (are served by the OS cache) hence the high throughput.</div>
<div><br></div><div>It seems to always be the same processes causing this (around 20). Looking at the state of one using sys:get_status returns [1]. Looks like a queue process. There are about 9K queues on the brokers when this happens and ~60K messages. Up to this happening the brokers hover around 40% cpu usage peaking at 60% when the message store is combining the messages (about every 5 minutes with our throughput).</div>
<div><br></div><div>It's interesting that this seems to be caused by a small amount of queue processes, always the same that come up in etop. Also the CPU graph is a direct reflection of the disk ops graph. </div><div>
<br></div><div>So what are the scenarios that will cause a queue process to access the disk in what looks like an infinite loop? Anything else that could be causing this behavior?</div><div><br></div><div>Thanks!</div><div>
<br></div><div>--</div><div>Raphael.</div><div><br></div><div>[1]</div><div>{status,<0.17050.0>,</div><div><div> {module,gen_server2},</div><div> [[{{#Ref<0.0.1.210978>,fhc_handle},</div><div> {handle,{file_descriptor,prim_file,{#Port<0.29642>,8437}},</div>
<div> 0,0,false,576,infinity,</div><div> [[<<192,0,0,0,0,0,64,144>>],</div><div> [<<128,0,0,0,0,0,64,144>>],</div><div> [<<0,0,0,0,0,0,65,166>>,</div>
<div> [<<218,154,254,188,128,203,224,...>>,</div><div> <<0,4,161,101,66,238,...>>]],</div><div> [<<192,0,0,0,0,0,64,143>>],</div>
<div> [<<128,0,0,0,0,0,64,...>>],</div><div> [<<0,0,0,0,0,0,...>>,</div><div> [<<134,78,165,193,...>>,<<0,4,161,...>>]],</div>
<div> [<<192,0,0,0,0,...>>],</div><div> [<<128,0,0,0,...>>],</div><div> [<<0,0,0,...>>,[<<"u<"...>>,<<...>>]],</div>
<div> [<<192,0,...>>],</div><div> [<<128,...>>],</div><div> [<<...>>|...],</div><div> [...]|...],</div><div> true,</div>
<div> "/var/lib/rabbitmq/mnesia/rabbit@broker1-2/queues/FIN41VKU7DXMV7IZOJED8MUO/journal.jif",</div><div> [write,binary,raw,read],</div><div> [{write_buffer,infinity}],</div>
<div> true,true,</div><div> {1303,269920,172427}}},</div><div> {{#Ref<0.0.474.103189>,fhc_handle},</div><div> {handle,{file_descriptor,prim_file,{#Port<0.116464>,596}},</div>
<div> 15487841,0,false,0,1048576,[],false,</div><div> "/var/lib/rabbitmq/mnesia/rabbit@broker1-2/msg_store_persistent/0.rdq",</div><div> [raw,binary,read],</div>
<div> [{write_buffer,1048576}],</div><div> false,true,</div><div> {1303,269928,232634}}},</div><div> {'$ancestors',[rabbit_amqqueue_sup,rabbit_sup,<0.131.0>]},</div>
<div> {{"/var/lib/rabbitmq/mnesia/rabbit@broker1-2/msg_store_persistent/0.rdq",</div><div> fhc_file},</div><div> {file,1,false}},</div><div> {fhc_age_tree,{2,</div><div> {{1303,269920,172427},</div>
<div> #Ref<0.0.1.210978>,nil,</div><div> {{1303,269928,232634},#Ref<0.0.474.103189>,nil,nil}}}},</div><div> {{"/var/lib/rabbitmq/mnesia/rabbit@broker1-2/queues/FIN41VKU7DXMV7IZOJED8MUO/journal.jif",</div>
<div> fhc_file},</div><div> {file,1,true}},</div><div> {'$initial_call',{gen,init_it,6}}],</div><div> running,<0.10398.0>,[],</div><div> [{header,"Status for generic server <0.17050.0>"},</div>
<div> {data,[{"Status",running},</div><div> {"Parent",<0.10398.0>},</div><div> {"Logged events",[]},</div><div> {"Queued messages",[]}]},</div>
<div> {data,[{"State",</div><div> {q,{amqqueue,{resource,<<"/right_net">>,queue,</div><div> <<"nanite-rs-instan"...>>},</div>
<div> true,false,none,</div><div> [{<<"x-messag"...>>,signedint,86400000}],</div><div> <0.17050.0>},</div>
<div> none,false,rabbit_variable_queue,</div><div> {vqstate,{[],[]},</div><div> {0,{[],...}},</div><div> {delta,undefined,...},</div>
<div> {277,...},</div><div> {...},...},</div><div> {[],[]},</div><div> {[],[]},</div><div> undefined,</div>
<div> {1303269928237652,#Ref<0.0.1710.21035>},</div><div> {1303269928766020,...},</div><div> undefined,...}}]}]]}</div></div>