[rabbitmq-discuss] RabbitMQ 2.3.1 high CPU usage & disks operations
Raphael Simon
raphael at rightscale.com
Mon Apr 25 23:12:07 BST 2011
Following up on this as we are still having the issue, I've pasted the
output of etop for a broker in a 'good state' [1] vs. one in a 'bad state'
[2]. You'll notice that broker1-1 (the one in 'bad state' i.e. using ~120%
cpu compared to ~20% for the one in 'good state') is spending a lot of time
in timers and reading from disk. Looking at the timer_server process on 1-1
shows that it's mostly running rabbit_amqqueue:sync_timeout, this seems to
be the culprit for the high cpu and disk ops. So what is the purpose of this
timer? and why would it suddenly trigger more often or at least be more
expensive to handle?
Thank you!
--
Raphael.
[1] Broker in 'good state'
========================================================================================
'rabbit at broker1-2'
21:57:33
Load: cpu 12 Memory: total 855004 binary
39949
procs 69263 processes 724592 code
11738
runq 0 atom 3149 ets
41428
Pid Name or Initial Func Time Reds Memory MsgQ Current
Function
----------------------------------------------------------------------------------------
<4385.6.0> error_logger '-' 975256 142672 0
gen_event:fetch_msg/
<4385.4844.299>etop_txt:init/1 '-' 959286******** 0
etop:update/1
***************proc_lib:init_p/5 '-' 101715 3000 0
erlang:hibernate/3
***************proc_lib:init_p/5 '-' 101715 3000 0
erlang:hibernate/3
***************proc_lib:init_p/5 '-' 101715 3000 0
erlang:hibernate/3
***************proc_lib:init_p/5 '-' 101715 3000 0
erlang:hibernate/3
***************proc_lib:init_p/5 '-' 101715 3000 0
erlang:hibernate/3
***************proc_lib:init_p/5 '-' 101715 3000 0
erlang:hibernate/3
***************proc_lib:init_p/5 '-' 101715 3000 0
erlang:hibernate/3
***************proc_lib:init_p/5 '-' 101700 142744 0
gen_server2:process_
***************proc_lib:init_p/5 '-' 101700 142744 0
gen_server2:process_
***************proc_lib:init_p/5 '-' 101700 142744 0
gen_server2:process_
<4385.25667.0> proc_lib:init_p/5 '-' 100267 2936 0
erlang:hibernate/3
<4385.11107.1> proc_lib:init_p/5 '-' 100267 2936 0
erlang:hibernate/3
<4385.19047.1> proc_lib:init_p/5 '-' 100267 2936 0
erlang:hibernate/3
***************proc_lib:init_p/5 '-' 100267 2936 0
erlang:hibernate/3
<4385.3996.13> proc_lib:init_p/5 '-' 100267 2936 0
erlang:hibernate/3
<4385.14391.47>proc_lib:init_p/5 '-' 100267 2936 0
erlang:hibernate/3
***************proc_lib:init_p/5 '-' 100267 2936 0
erlang:hibernate/3
<4385.10908.1> proc_lib:init_p/5 '-' 100252 109296 0
gen_server2:process_
<4385.25705.1> proc_lib:init_p/5 '-' 100252 109296 0
gen_server2:process_
<4385.28252.1> proc_lib:init_p/5 '-' 100252 109296 0
gen_server2:process_
<4385.32119.1> proc_lib:init_p/5 '-' 100252 109296 0
gen_server2:process_
<4385.24583.61>proc_lib:init_p/5 '-' 100252 109296 0
gen_server2:process_
***************proc_lib:init_p/5 '-' 100252 109296 0
gen_server2:process_
***************proc_lib:init_p/5 '-' 92046 142744 0
ets:select/1
<4385.186.0> rabbit_memory_monito '-' 41961 779760 0
gen_server2:process_
<4385.71.0> timer_server '-' 13240 2914432 0
gen_server:loop/6
<4385.25604.0> proc_lib:init_p/5 '-' 11288 1116400 0
gen_server2:process_
***************erlang:apply/2 '-' 935912342344 0
file_io_server:serve
<4385.191.0> msg_store_persistent '-' 8946 2914424 0
gen_server2:process_
<4385.4890.299>proc_lib:init_p/5 '-' 4164 13832 0
gen_server2:process_
<4385.0.0> init '-' 2147 21488 0
init:loop/1
<4385.4800.299>proc_lib:init_p/5 '-' 1892 34616 0
gen_server2:process_
<4385.4886.299>proc_lib:init_p/5 '-' 1774 24784 0
rabbit_reader:mainlo
<4385.3005.28> proc_lib:init_p/5 '-' 637 14120 0
gen_server2:process_
<4385.1688.28> proc_lib:init_p/5 '-' 609 11176 0
gen:do_call/4
<4385.2072.28> proc_lib:init_p/5 '-' 457 3200 0
erlang:hibernate/3
<4385.2163.28> proc_lib:init_p/5 '-' 449 11104 0
gen_server2:process_
<4385.957.28> proc_lib:init_p/5 '-' 438 26896 0
gen_server2:process_
<4385.29794.27>proc_lib:init_p/5 '-' 423 10848 1
gen_server2:process_
<4385.31982.27>proc_lib:init_p/5 '-' 421 14120 0
gen_server2:process_
***************proc_lib:init_p/5 '-' 414 2984 0
erlang:hibernate/3
<4385.25130.0> proc_lib:init_p/5 '-' 414 2984 0
erlang:hibernate/3
<4385.25554.0> proc_lib:init_p/5 '-' 414 2984 0
erlang:hibernate/3
<4385.25800.0> proc_lib:init_p/5 '-' 414 2936 0
erlang:hibernate/3
<4385.25976.0> proc_lib:init_p/5 '-' 414 2936 0
erlang:hibernate/3
<4385.26169.0> proc_lib:init_p/5 '-' 414 2936 0
erlang:hibernate/3
<4385.26369.0> proc_lib:init_p/5 '-' 414 2936 0
erlang:hibernate/3
<4385.26677.0> proc_lib:init_p/5 '-' 414 3000 0
erlang:hibernate/3
========================================================================================
[2] Broker in 'bad state'
========================================================================================
'rabbit at broker1-1'
21:57:34
Load: cpu 102 Memory: total 1047278 binary
82002
procs 73585 processes 870711 code
11738
runq 1 atom 2750 ets
45360
Pid Name or Initial Func Time Reds Memory MsgQ Current
Function
----------------------------------------------------------------------------------------
<4385.77.0> timer_server '-' 5994258 1801600 0
gen_server:loop/6
<4385.6.0> error_logger '-' 1617319 142672 0
gen_event:fetch_msg/
***************etop_txt:init/1 '-' 1104480******** 0
etop:update/1
<4385.932.0> proc_lib:init_p/5 '-' 534423 1114040 2
prim_file:drv_get_re
<4385.1181.0> proc_lib:init_p/5 '-' 527067 1113976 0
prim_file:drv_get_re
<4385.1083.0> proc_lib:init_p/5 '-' 523629 1114040 2
prim_file:drv_get_re
<4385.1113.0> proc_lib:init_p/5 '-' 510615 1113976 0
prim_file:drv_get_re
<4385.1109.0> proc_lib:init_p/5 '-' 505293 1114008 1
prim_file:drv_get_re
<4385.1121.0> proc_lib:init_p/5 '-' 504923 1114008 1
prim_file:drv_get_re
<4385.1142.0> proc_lib:init_p/5 '-' 490480 1113976 0
prim_file:drv_get_re
<4385.1107.0> proc_lib:init_p/5 '-' 484069 1114040 2
prim_file:drv_get_re
***************proc_lib:init_p/5 '-' 480740 75888 1
prim_file:drv_get_re
<4385.1101.0> proc_lib:init_p/5 '-' 479437 1113976 0
prim_file:drv_get_re
<4385.2678.0> proc_lib:init_p/5 '-' 467491 88632 0
prim_file:drv_get_re
<4385.1185.0> proc_lib:init_p/5 '-' 460844 1114008 1
prim_file:drv_get_re
***************proc_lib:init_p/5 '-' 457993 109304 0
prim_file:drv_get_re
<4385.1106.0> proc_lib:init_p/5 '-' 442674 1113976 0
prim_file:drv_get_re
<4385.1079.0> proc_lib:init_p/5 '-' 433061 1113976 0
prim_file:drv_get_re
***************proc_lib:init_p/5 '-' 427045 142784 1
prim_file:drv_get_re
<4385.1093.0> proc_lib:init_p/5 '-' 426558 1114008 1
prim_file:drv_get_re
<4385.1172.0> proc_lib:init_p/5 '-' 425739 1113976 0
prim_file:drv_get_re
<4385.1173.0> proc_lib:init_p/5 '-' 424881 1113976 0
prim_file:drv_get_re
***************proc_lib:init_p/5 '-' 424511 109336 1
prim_file:drv_get_re
<4385.1111.0> proc_lib:init_p/5 '-' 423289 1114008 1
prim_file:drv_get_re
***************proc_lib:init_p/5 '-' 418404 109304 0
prim_file:drv_get_re
<4385.1115.0> proc_lib:init_p/5 '-' 406488 1113976 0
prim_file:drv_get_re
<4385.936.0> proc_lib:init_p/5 '-' 403025 1113976 0
prim_file:drv_get_re
<4385.1923.0> proc_lib:init_p/5 '-' 397771 109304 0
prim_file:drv_get_re
<4385.6377.0> proc_lib:init_p/5 '-' 393461 122160 0
prim_file:drv_get_re
<4385.1082.0> proc_lib:init_p/5 '-' 384064 1113976 0
prim_file:drv_get_re
<4385.3805.0> proc_lib:init_p/5 '-' 382381 88744 1
prim_file:drv_get_re
<4385.6385.0> proc_lib:init_p/5 '-' 380755 109416 1
prim_file:drv_get_re
***************proc_lib:init_p/5 '-' 376803 109304 0
prim_file:drv_get_re
<4385.1117.0> proc_lib:init_p/5 '-' 371288 1114008 1
prim_file:drv_get_re
<4385.3819.0> proc_lib:init_p/5 '-' 358139 122080 0
prim_file:drv_get_re
<4385.1145.0> proc_lib:init_p/5 '-' 321316 1114008 1
prim_file:drv_get_re
<4385.1628.0> proc_lib:init_p/5 '-' 282420 88712 0
prim_file:drv_get_re
<4385.200.0> rabbit_memory_monito '-' 112557 1682368 0
gen_server2:process_
***************proc_lib:init_p/5 '-' 111776 2992 0
erlang:hibernate/3
<4385.185.2916>proc_lib:init_p/5 '-' 111761 142744 0
gen_server2:process_
<4385.5774.224>proc_lib:init_p/5 '-' 110328 2936 0
erlang:hibernate/3
<4385.3548.1> proc_lib:init_p/5 '-' 110328 2936 0
erlang:hibernate/3
<4385.4155.1> proc_lib:init_p/5 '-' 110328 2936 0
erlang:hibernate/3
<4385.4385.1> proc_lib:init_p/5 '-' 110328 2936 0
erlang:hibernate/3
<4385.5053.1> proc_lib:init_p/5 '-' 110328 2936 0
erlang:hibernate/3
<4385.6322.1> proc_lib:init_p/5 '-' 110328 2936 0
erlang:hibernate/3
***************proc_lib:init_p/5 '-' 110328 2936 0
erlang:hibernate/3
***************proc_lib:init_p/5 '-' 110328 2936 0
erlang:hibernate/3
***************proc_lib:init_p/5 '-' 110328 2936 0
erlang:hibernate/3
***************proc_lib:init_p/5 '-' 110328 2936 0
erlang:hibernate/3
========================================================================================
On Thu, Apr 21, 2011 at 10:38 AM, Raphael Simon <raphael at rightscale.com>wrote:
> Hello Matthew,
>
> We set the number of available file descriptors to around 32,000 and I can
> see that RabbitMQ is using around 14,000 (with the rabbitadmin management
> agent) so there should be a comfortable margin. The number of queues deleted
> was around 2,000 (we have a queue gc daemon that will periodically delete
> queues with no consumers as we can't guarantee that clients will always
> leave cleanly). Wrt to messages expiring: the message throughput is fairly
> uniform and the number of messages expiring at any given time should remain
> fairly constant. I guess bursts are possible though, I would estimate the
> max number of messages expiring all at once to around 40k - do you see an
> issue with this? etop did show the timer process often as the busiest.
>
> Thanks for the help!
>
> --
> Raphael.
>
>
> On Thu, Apr 21, 2011 at 3:42 AM, Matthew Sackman <matthew at rabbitmq.com>wrote:
>
>> Hi Raphael,
>>
>> Thanks for the detailed report. This is an interesting one, and a few
>> ideas spring to mind.
>>
>> You seem to be using the x-message-ttl feature now, with message expiry
>> around 1 day. Do these events happen to correspond with the bulk expiry
>> of a lot of messages? Certainly 60,000 odd msgs in the broker shouldn't
>> cause any problems at all, but just want to check that one. Also, when
>> you say "many queues get deleted", how many are we talking about? The
>> only thing we can think of is if you're deleting more queues than you
>> have file descriptors available for the process then you may run into
>> problems: Rabbit won't crash, but it might thrash fds between different
>> queues which can overall result in vastly more disk activity than
>> necessary.
>>
>> The normal Linux default fd limit (ulimit -n) is 1024, though on OS X,
>> it's down at 256 by default IIRC. If you're deleting more than about
>> 900 queues at any one go, then I'd suggest that you try increasing that
>> substantially: having about double the number of fds available than
>> queues should eliminate any thrashing - if indeed this is the problem.
>> If this is the problem then by strace, you should be able to see an
>> awful lot of open and close calls to various files, whereas if this
>> problem is solved by higher fd ulimit then you should see files stay
>> open for much longer.
>>
>> Please let us know how you get on.
>>
>> Matthew
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> rabbitmq-discuss at lists.rabbitmq.com
>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20110425/3524677f/attachment-0001.htm>
More information about the rabbitmq-discuss
mailing list