Following up on this as we are still having the issue, I've pasted the output of etop for a broker in a 'good state' [1] vs. one in a 'bad state' [2]. You'll notice that broker1-1 (the one in 'bad state' i.e. using ~120% cpu compared to ~20% for the one in 'good state') is spending a lot of time in timers and reading from disk. Looking at the timer_server process on 1-1 shows that it's mostly running rabbit_amqqueue:sync_timeout, this seems to be the culprit for the high cpu and disk ops. So what is the purpose of this timer? and why would it suddenly trigger more often or at least be more expensive to handle?<div>
<br></div><div>Thank you!</div><div><br></div><div>--</div><div>Raphael.</div><div><br></div><div>[1] Broker in 'good state'</div><div><div>========================================================================================</div>
<div> 'rabbit@broker1-2' 21:57:33</div><div> Load: cpu 12 Memory: total 855004 binary 39949</div><div> procs 69263 processes 724592 code 11738</div>
<div> runq 0 atom 3149 ets 41428</div><div><br></div><div>Pid Name or Initial Func Time Reds Memory MsgQ Current Function</div><div>----------------------------------------------------------------------------------------</div>
<div><4385.6.0> error_logger '-' 975256 142672 0 gen_event:fetch_msg/</div><div><4385.4844.299>etop_txt:init/1 '-' 959286******** 0 etop:update/1</div><div>
***************proc_lib:init_p/5 '-' 101715 3000 0 erlang:hibernate/3</div><div>***************proc_lib:init_p/5 '-' 101715 3000 0 erlang:hibernate/3</div><div>***************proc_lib:init_p/5 '-' 101715 3000 0 erlang:hibernate/3</div>
<div>***************proc_lib:init_p/5 '-' 101715 3000 0 erlang:hibernate/3</div><div>***************proc_lib:init_p/5 '-' 101715 3000 0 erlang:hibernate/3</div><div>***************proc_lib:init_p/5 '-' 101715 3000 0 erlang:hibernate/3</div>
<div>***************proc_lib:init_p/5 '-' 101715 3000 0 erlang:hibernate/3</div><div>***************proc_lib:init_p/5 '-' 101700 142744 0 gen_server2:process_</div><div>***************proc_lib:init_p/5 '-' 101700 142744 0 gen_server2:process_</div>
<div>***************proc_lib:init_p/5 '-' 101700 142744 0 gen_server2:process_</div><div><4385.25667.0> proc_lib:init_p/5 '-' 100267 2936 0 erlang:hibernate/3</div><div>
<4385.11107.1> proc_lib:init_p/5 '-' 100267 2936 0 erlang:hibernate/3</div><div><4385.19047.1> proc_lib:init_p/5 '-' 100267 2936 0 erlang:hibernate/3</div><div>
***************proc_lib:init_p/5 '-' 100267 2936 0 erlang:hibernate/3</div><div><4385.3996.13> proc_lib:init_p/5 '-' 100267 2936 0 erlang:hibernate/3</div><div><4385.14391.47>proc_lib:init_p/5 '-' 100267 2936 0 erlang:hibernate/3</div>
<div>***************proc_lib:init_p/5 '-' 100267 2936 0 erlang:hibernate/3</div><div><4385.10908.1> proc_lib:init_p/5 '-' 100252 109296 0 gen_server2:process_</div><div>
<4385.25705.1> proc_lib:init_p/5 '-' 100252 109296 0 gen_server2:process_</div><div><4385.28252.1> proc_lib:init_p/5 '-' 100252 109296 0 gen_server2:process_</div>
<div><4385.32119.1> proc_lib:init_p/5 '-' 100252 109296 0 gen_server2:process_</div><div><4385.24583.61>proc_lib:init_p/5 '-' 100252 109296 0 gen_server2:process_</div>
<div>***************proc_lib:init_p/5 '-' 100252 109296 0 gen_server2:process_</div><div>***************proc_lib:init_p/5 '-' 92046 142744 0 ets:select/1</div><div><4385.186.0> rabbit_memory_monito '-' 41961 779760 0 gen_server2:process_</div>
<div><4385.71.0> timer_server '-' 13240 2914432 0 gen_server:loop/6</div><div><4385.25604.0> proc_lib:init_p/5 '-' 11288 1116400 0 gen_server2:process_</div>
<div>***************erlang:apply/2 '-' 935912342344 0 file_io_server:serve</div><div><4385.191.0> msg_store_persistent '-' 8946 2914424 0 gen_server2:process_</div><div>
<4385.4890.299>proc_lib:init_p/5 '-' 4164 13832 0 gen_server2:process_</div><div><4385.0.0> init '-' 2147 21488 0 init:loop/1</div><div><4385.4800.299>proc_lib:init_p/5 '-' 1892 34616 0 gen_server2:process_</div>
<div><4385.4886.299>proc_lib:init_p/5 '-' 1774 24784 0 rabbit_reader:mainlo</div><div><4385.3005.28> proc_lib:init_p/5 '-' 637 14120 0 gen_server2:process_</div>
<div><4385.1688.28> proc_lib:init_p/5 '-' 609 11176 0 gen:do_call/4</div><div><4385.2072.28> proc_lib:init_p/5 '-' 457 3200 0 erlang:hibernate/3</div><div>
<4385.2163.28> proc_lib:init_p/5 '-' 449 11104 0 gen_server2:process_</div><div><4385.957.28> proc_lib:init_p/5 '-' 438 26896 0 gen_server2:process_</div>
<div><4385.29794.27>proc_lib:init_p/5 '-' 423 10848 1 gen_server2:process_</div><div><4385.31982.27>proc_lib:init_p/5 '-' 421 14120 0 gen_server2:process_</div>
<div>***************proc_lib:init_p/5 '-' 414 2984 0 erlang:hibernate/3</div><div><4385.25130.0> proc_lib:init_p/5 '-' 414 2984 0 erlang:hibernate/3</div><div>
<4385.25554.0> proc_lib:init_p/5 '-' 414 2984 0 erlang:hibernate/3</div><div><4385.25800.0> proc_lib:init_p/5 '-' 414 2936 0 erlang:hibernate/3</div><div>
<4385.25976.0> proc_lib:init_p/5 '-' 414 2936 0 erlang:hibernate/3</div><div><4385.26169.0> proc_lib:init_p/5 '-' 414 2936 0 erlang:hibernate/3</div><div>
<4385.26369.0> proc_lib:init_p/5 '-' 414 2936 0 erlang:hibernate/3</div><div><4385.26677.0> proc_lib:init_p/5 '-' 414 3000 0 erlang:hibernate/3</div><div>
========================================================================================</div><div><br></div><div>[2] Broker in 'bad state'</div><div><div>========================================================================================</div>
<div> 'rabbit@broker1-1' 21:57:34</div><div> Load: cpu 102 Memory: total 1047278 binary 82002</div><div> procs 73585 processes 870711 code 11738</div>
<div> runq 1 atom 2750 ets 45360</div><div><br></div><div>Pid Name or Initial Func Time Reds Memory MsgQ Current Function</div><div>----------------------------------------------------------------------------------------</div>
<div><4385.77.0> timer_server '-' 5994258 1801600 0 gen_server:loop/6</div><div><4385.6.0> error_logger '-' 1617319 142672 0 gen_event:fetch_msg/</div>
<div>***************etop_txt:init/1 '-' 1104480******** 0 etop:update/1</div><div><4385.932.0> proc_lib:init_p/5 '-' 534423 1114040 2 prim_file:drv_get_re</div><div><4385.1181.0> proc_lib:init_p/5 '-' 527067 1113976 0 prim_file:drv_get_re</div>
<div><4385.1083.0> proc_lib:init_p/5 '-' 523629 1114040 2 prim_file:drv_get_re</div><div><4385.1113.0> proc_lib:init_p/5 '-' 510615 1113976 0 prim_file:drv_get_re</div>
<div><4385.1109.0> proc_lib:init_p/5 '-' 505293 1114008 1 prim_file:drv_get_re</div><div><4385.1121.0> proc_lib:init_p/5 '-' 504923 1114008 1 prim_file:drv_get_re</div>
<div><4385.1142.0> proc_lib:init_p/5 '-' 490480 1113976 0 prim_file:drv_get_re</div><div><4385.1107.0> proc_lib:init_p/5 '-' 484069 1114040 2 prim_file:drv_get_re</div>
<div>***************proc_lib:init_p/5 '-' 480740 75888 1 prim_file:drv_get_re</div><div><4385.1101.0> proc_lib:init_p/5 '-' 479437 1113976 0 prim_file:drv_get_re</div><div>
<4385.2678.0> proc_lib:init_p/5 '-' 467491 88632 0 prim_file:drv_get_re</div><div><4385.1185.0> proc_lib:init_p/5 '-' 460844 1114008 1 prim_file:drv_get_re</div>
<div>***************proc_lib:init_p/5 '-' 457993 109304 0 prim_file:drv_get_re</div><div><4385.1106.0> proc_lib:init_p/5 '-' 442674 1113976 0 prim_file:drv_get_re</div><div>
<4385.1079.0> proc_lib:init_p/5 '-' 433061 1113976 0 prim_file:drv_get_re</div><div>***************proc_lib:init_p/5 '-' 427045 142784 1 prim_file:drv_get_re</div><div><4385.1093.0> proc_lib:init_p/5 '-' 426558 1114008 1 prim_file:drv_get_re</div>
<div><4385.1172.0> proc_lib:init_p/5 '-' 425739 1113976 0 prim_file:drv_get_re</div><div><4385.1173.0> proc_lib:init_p/5 '-' 424881 1113976 0 prim_file:drv_get_re</div>
<div>***************proc_lib:init_p/5 '-' 424511 109336 1 prim_file:drv_get_re</div><div><4385.1111.0> proc_lib:init_p/5 '-' 423289 1114008 1 prim_file:drv_get_re</div><div>
***************proc_lib:init_p/5 '-' 418404 109304 0 prim_file:drv_get_re</div><div><4385.1115.0> proc_lib:init_p/5 '-' 406488 1113976 0 prim_file:drv_get_re</div><div><4385.936.0> proc_lib:init_p/5 '-' 403025 1113976 0 prim_file:drv_get_re</div>
<div><4385.1923.0> proc_lib:init_p/5 '-' 397771 109304 0 prim_file:drv_get_re</div><div><4385.6377.0> proc_lib:init_p/5 '-' 393461 122160 0 prim_file:drv_get_re</div>
<div><4385.1082.0> proc_lib:init_p/5 '-' 384064 1113976 0 prim_file:drv_get_re</div><div><4385.3805.0> proc_lib:init_p/5 '-' 382381 88744 1 prim_file:drv_get_re</div>
<div><4385.6385.0> proc_lib:init_p/5 '-' 380755 109416 1 prim_file:drv_get_re</div><div>***************proc_lib:init_p/5 '-' 376803 109304 0 prim_file:drv_get_re</div><div>
<4385.1117.0> proc_lib:init_p/5 '-' 371288 1114008 1 prim_file:drv_get_re</div><div><4385.3819.0> proc_lib:init_p/5 '-' 358139 122080 0 prim_file:drv_get_re</div>
<div><4385.1145.0> proc_lib:init_p/5 '-' 321316 1114008 1 prim_file:drv_get_re</div><div><4385.1628.0> proc_lib:init_p/5 '-' 282420 88712 0 prim_file:drv_get_re</div>
<div><4385.200.0> rabbit_memory_monito '-' 112557 1682368 0 gen_server2:process_</div><div>***************proc_lib:init_p/5 '-' 111776 2992 0 erlang:hibernate/3</div><div>
<4385.185.2916>proc_lib:init_p/5 '-' 111761 142744 0 gen_server2:process_</div><div><4385.5774.224>proc_lib:init_p/5 '-' 110328 2936 0 erlang:hibernate/3</div><div>
<4385.3548.1> proc_lib:init_p/5 '-' 110328 2936 0 erlang:hibernate/3</div><div><4385.4155.1> proc_lib:init_p/5 '-' 110328 2936 0 erlang:hibernate/3</div><div>
<4385.4385.1> proc_lib:init_p/5 '-' 110328 2936 0 erlang:hibernate/3</div><div><4385.5053.1> proc_lib:init_p/5 '-' 110328 2936 0 erlang:hibernate/3</div><div>
<4385.6322.1> proc_lib:init_p/5 '-' 110328 2936 0 erlang:hibernate/3</div><div>***************proc_lib:init_p/5 '-' 110328 2936 0 erlang:hibernate/3</div><div>***************proc_lib:init_p/5 '-' 110328 2936 0 erlang:hibernate/3</div>
<div>***************proc_lib:init_p/5 '-' 110328 2936 0 erlang:hibernate/3</div><div>***************proc_lib:init_p/5 '-' 110328 2936 0 erlang:hibernate/3</div><div>========================================================================================</div>
</div><div><br></div><br><div class="gmail_quote">On Thu, Apr 21, 2011 at 10:38 AM, Raphael Simon <span dir="ltr"><<a href="mailto:raphael@rightscale.com">raphael@rightscale.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Hello Matthew,<div><br></div><div>We set the number of available file descriptors to around 32,000 and I can see that RabbitMQ is using around 14,000 (with the rabbitadmin management agent) so there should be a comfortable margin. The number of queues deleted was around 2,000 (we have a queue gc daemon that will periodically delete queues with no consumers as we can't guarantee that clients will always leave cleanly). Wrt to messages expiring: the message throughput is fairly uniform and the number of messages expiring at any given time should remain fairly constant. I guess bursts are possible though, I would estimate the max number of messages expiring all at once to around 40k - do you see an issue with this? etop did show the timer process often as the busiest.</div>
<div><br></div><div>Thanks for the help!</div><div><br></div><div>--</div><div>Raphael.<div><div></div><div class="h5"><br><br><div class="gmail_quote">On Thu, Apr 21, 2011 at 3:42 AM, Matthew Sackman <span dir="ltr"><<a href="mailto:matthew@rabbitmq.com" target="_blank">matthew@rabbitmq.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Raphael,<br>
<br>
Thanks for the detailed report. This is an interesting one, and a few<br>
ideas spring to mind.<br>
<br>
You seem to be using the x-message-ttl feature now, with message expiry<br>
around 1 day. Do these events happen to correspond with the bulk expiry<br>
of a lot of messages? Certainly 60,000 odd msgs in the broker shouldn't<br>
cause any problems at all, but just want to check that one. Also, when<br>
you say "many queues get deleted", how many are we talking about? The<br>
only thing we can think of is if you're deleting more queues than you<br>
have file descriptors available for the process then you may run into<br>
problems: Rabbit won't crash, but it might thrash fds between different<br>
queues which can overall result in vastly more disk activity than<br>
necessary.<br>
<br>
The normal Linux default fd limit (ulimit -n) is 1024, though on OS X,<br>
it's down at 256 by default IIRC. If you're deleting more than about<br>
900 queues at any one go, then I'd suggest that you try increasing that<br>
substantially: having about double the number of fds available than<br>
queues should eliminate any thrashing - if indeed this is the problem.<br>
If this is the problem then by strace, you should be able to see an<br>
awful lot of open and close calls to various files, whereas if this<br>
problem is solved by higher fd ulimit then you should see files stay<br>
open for much longer.<br>
<br>
Please let us know how you get on.<br>
<br>
Matthew<br>
_______________________________________________<br>
rabbitmq-discuss mailing list<br>
<a href="mailto:rabbitmq-discuss@lists.rabbitmq.com" target="_blank">rabbitmq-discuss@lists.rabbitmq.com</a><br>
<a href="https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss" target="_blank">https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss</a><br>
</blockquote></div><br></div></div></div>
</blockquote></div><br></div>