[rabbitmq-discuss] RabbitMQ Stability Issues with large queue - 2011-12-28

Wed Dec 28 22:20:20 GMT 2011

Hi All,

Ran the publisher against a single node (removed the clustering) and was able to grab the report just before crashing.
Here are the details:

==> dc001.log <==
=INFO REPORT==== 28-Dec-2011::17:00:00 ===
vm_memory_high_watermark set. Memory used:8662988728 allowed:5153960755

=INFO REPORT==== 28-Dec-2011::17:00:00 ===
    alarm_handler: {set,{{vm_memory_high_watermark,'dc001 at rmquat-m01'},[]}}

=WARNING REPORT==== 28-Dec-2011::17:00:00 ===
Mnesia('dc001 at rmquat-m01'): ** WARNING ** Mnesia is overloaded: {dump_log, time_threshold}

Reporting server status on {{2011,12,28},{22,6,3}}

Status of node 'dc001 at rmquat-m01' ...
[{pid,13173},
 {running_applications,
     [{rabbitmq_management_visualiser,"RabbitMQ Visualiser","2.7.0"},
      {rabbitmq_management,"RabbitMQ Management Console","2.7.0"},
      {rabbitmq_mochiweb,"RabbitMQ Mochiweb Embedding","2.7.0"},
      {webmachine,"webmachine","1.7.0-rmq2.7.0-hg"},
      {amqp_client,"RabbitMQ AMQP Client","2.7.0"},
      {rabbitmq_management_agent,"RabbitMQ Management Agent","2.7.0"},
      {rabbit,"RabbitMQ","2.7.0"},
      {mnesia,"MNESIA  CXC 138 12","4.4.17"},
      {os_mon,"CPO  CXC 138 46","2.2.5"},
      {sasl,"SASL  CXC 138 11","2.1.9.3"},
      {mochiweb,"MochiMedia Web Server","1.3-rmq2.7.0-git"},
      {inets,"INETS  CXC 138 49","5.5.2"},
      {stdlib,"ERTS  CXC 138 10","1.17.3"},
      {kernel,"ERTS  CXC 138 10","2.14.3"}]},
 {os,{unix,linux}},
 {erlang_version,
     "Erlang R14B02 (erts-5.8.3) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:30] [hipe] [kernel-poll:true]\n"},
 {memory,
     [{total,8825539568},
      {processes,4061532008},
      {processes_used,4061085200},
      {system,4764007560},
      {atom,1677993},
      {atom_used,1659957},
      {binary,4448013416},
      {code,16452002},
      {ets,295062344}]},
 {vm_memory_high_watermark,0.5999999999767169},
 {vm_memory_limit,5153960755}]

Cluster status of node 'dc001 at rmquat-m01' ...
[{nodes,[{disc,['dc001 at rmquat-m01']}]},{running_nodes,['dc001 at rmquat-m01']}]

Application environment of node 'dc001 at rmquat-m01' ...
[{auth_backends,[rabbit_auth_backend_internal]},
 {auth_mechanisms,['PLAIN','AMQPLAIN']},
 {backing_queue_module,rabbit_variable_queue},
 {cluster_nodes,[]},
 {collect_statistics,fine},
 {collect_statistics_interval,5000},
 {default_permissions,[<<".*">>,<<".*">>,<<".*">>]},
 {default_user,<<"guest">>},
 {default_user_tags,[administrator]},
 {default_vhost,<<"/">>},
 {delegate_count,16},
 {error_logger,{file,"/data/rabbitmq/dc001/log/dc001.log"}},
 {frame_max,131072},
 {hipe_compile,true},
 {included_applications,[]},
 {msg_store_file_size_limit,16777216},
 {msg_store_index_module,rabbit_msg_store_ets_index},
 {queue_index_max_journal_entries,262144},
 {sasl_error_logger,{file,"/data/rabbitmq/dc001/log/dc001-sasl.log"}},
 {server_properties,[]},
 {ssl_listeners,[]},
 {ssl_options,[]},
 {tcp_listen_options,[binary,
                      {packet,raw},
                      {reuseaddr,true},
                      {backlog,128},
                      {nodelay,true},
                      {exit_on_close,false}]},
 {tcp_listeners,[5672]},
 {trace_vhosts,[]},
 {vm_memory_high_watermark,0.6}]

Connections:
pid	address	port	peer_address	peer_port	ssl	peer_cert_subject	peer_cert_issuer	peer_cert_validity	auth_mechanism	ssl_protocol	ssl_key_exchange	ssl_cipher	ssl_hash	protocol	user	vhost	timeout	frame_max	client_properties	recv_oct	recv_cnt	send_oct	send_cnt	send_pend	state	channels
<'dc001 at rmquat-m01'.1.4461.3>	192.168.0.100	5672	192.168.0.97	33792	false				PLAIN					{0,8,0}	guest	/	0	131072	[]	79447230	86718	292	4	0	blocked	1

Channels:
pid	connection	number	user	vhost	transactional	confirm	consumer_count	messages_unacknowledged	messages_unconfirmed	messages_uncommitted	acks_uncommitted	prefetch_count	client_flow_blocked
<'dc001 at rmquat-m01'.1.4465.3>	<'dc001 at rmquat-m01'.1.4461.3>	1	guest	/	false	false	0	0	0	0	0	0	false

Queues on /:
pid	name	durable	auto_delete	arguments	owner_pid	slave_pids	synchronised_slave_pids	exclusive_consumer_pid	exclusive_consumer_tag	messages_ready	messages_unacknowledged	messages	consumers	memory	backing_queue_status
<'dc001 at rmquat-m01'.1.9112.0>	dc.data:durable:all	true	false	[]						4589172	0	4589172	0	4041816264	[{q1,0}, {q2,0}, {delta,{delta,2667892,1921280,4589172}}, {q3,132544}, {q4,2535348}, {len,4589172}, {pending_acks,0}, {target_ram_count,0}, {ram_msg_count,2535348}, {ram_ack_count,0}, {next_seq_id,4589172}, {persistent_count,0}, {avg_ingress_rate,0.0}, {avg_egress_rate,0.0}, {avg_ack_ingress_rate,0.0}, {avg_ack_egress_rate,0.0}]

Exchanges on /:
name	type	durable	auto_delete	internal	arguments
amq.direct	direct	true	false	false	[]
dc.data:fanout	fanout	true	false	false	[]
amq.topic	topic	true	false	false	[]
amq.rabbitmq.trace	topic	true	false	false	[]
amq.rabbitmq.log	topic	true	false	false	[]
amq.fanout	fanout	true	false	false	[]
amq.headers	headers	true	false	false	[]
	direct	true	false	false	[]
amq.match	headers	true	false	false	[]

Bindings on /:
source_name	source_kind	destination_name	destination_kind	routing_key	arguments
	exchange	dc.data:durable:all	queue	dc.data:durable:all	[]
dc.data:fanout	exchange	dc.data:durable:all	queue		[]

Consumers on /:

Permissions on /:
user	configure	write	read
guest	.*	.*	.*

End of server status report
...done.

On Dec 28, 2011, at 9:22 AM, DawgTool wrote:

> RabbitMQ Stability Issues with large queue - 2011-12-28
> 
> Hi All,
> 
> I posted in the IRC channel a few nights ago, and they suggested that I bring this to the listserv.
> Hopefully can get some suggestions on how to keep my servers from crashing.
> Thanks,
> -- DawgTool
> 
> Cluster Info:
> Cluster status of node 'dc001 at rmquat-m01' ...
> [{nodes,[{disc,['dc001 at rmquat-m04','dc001 at rmquat-m03','dc001 at rmquat-m02',
>                'dc001 at rmquat-m01']}]},
> {running_nodes,['dc001 at rmquat-m04','dc001 at rmquat-m03','dc001 at rmquat-m02',
>                 'dc001 at rmquat-m01']}]
> 
> Config Info:
> ==> enabled_plugins <==
> [rabbitmq_management,rabbitmq_management_agent,rabbitmq_management_visualiser].
> 
> ==> rabbitmq.config <==
> [
>  {rabbit,                    [{vm_memory_high_watermark, 0.6},
>                               {collect_statistics_interval, 5000},
>                               {hipe_compile, true}
>                              ]
>  },
>  {rabbitmq_management,       [ {http_log_dir, "/data/rabbitmq/dc001/rabbit-mgmt"} ] },
>  {rabbitmq_management_agent, [ {force_fine_statistics, true} ] }
> ].
> 
> ==> rabbitmq-env.conf <==
> NODENAME=dc001
> BASE=/data/rabbitmq/dc001
> MNESIA_BASE=/data/rabbitmq/dc001/mnesia
> LOG_BASE=/data/rabbitmq/dc001/log
> SERVER_START_ARGS="-smp enable"
> 
> 
> IRC:
> [17:32] <dawgtool> background. doing some testing on 2.7.0 : 4 servers 2vcpu, 8gb ram, 80gb disc.
> [17:32] <dawgtool> cluser is setup all disc, currently one exchange durable fanout
> [17:33] <dawgtool> one queue also durable bind to the exchange.
> [17:33] <dawgtool> i'm pushing about 5M records, payload is ~500bytes each record
> [17:34] <dawgtool> rate is about 14k/s (which seems pretty slow)
> [17:35] <dawgtool> but my problem is, I'm testing a case where they consumers are busy or unavailable, so the queues would be filling up.
> [17:35] <dawgtool> even after slowing the publish rate to about 4k/s the mirrored queue does not complete on any of the clusters nodes other then master.
> [17:37] <dawgtool> memory seems to be the biggest issue here, as the servers will grow passed the high water mark, and eventually crash one at a time.
> [17:37] <dawgtool> once they are restarted, most servers in the cluster will have about 200k to 300k of messages in their queue
> [17:40] <dawgtool> so question is, why is so much memory being consumed (on disk these records are about 5.5GB) RabbitMQ pushes to 7.9 real, 11.9 virtual (swapping).
> [17:40] <dawgtool> why is the queue not stopping the publishers (RAM based clusters seem to stop the publisher until it can be spilled to disk)
> [17:41] <dawgtool> Why is mirroring unreliable in this test.
> [17:41] <dawgtool> ok, i'm done with the backgroud. =)
> [17:41] <dawgtool> lol
> [17:42] <antares_> having 300K messages in one queue will result in RAM consumption like this
> [17:42] <antares_> 30 queues with 10K is a better option
> [17:43] <antares_> I can't say for mirroring but my understanding is that mirroring cannot decrease RAM usage
> [17:43] <dawgtool> true, i need to make sure I don't loose any records, so disc with mirror
> [17:44] <dawgtool> does performance get that bad after 300k message in a queue?
> [17:44] <antares_> this kind of questions is better asked on the rabbitmq-discuss (mailing list)
> [17:45] <antares_> the exact number will vary but yes, having queues with excessive # of messages that are not consumed will result in more or less this behavior
> [17:45] <dawgtool> yea, just joined the mailing list, I can post it there. was hoping someone had a quick answer. =)
> [17:45] <antares_> each queue is backed by one Erlang process and Erlang VM GC releases process memory all at once
> [17:46] <dawgtool> even with the -smp set to true?
> [17:46] <antares_> so having large amount of messages in your queues impedes that
> [17:46] <antares_> dawgtool: I doubt that -smp affects GC behavior
> [17:46] <dawgtool> hmmm
> [17:47] <antares_> in this regard anyway, because there is still no shared heap
> [17:47] <antares_> but rabbitmq team members definitely know more than I do
> [17:47] <antares_> and they are all on rabbitmq-discuss
> [17:48] <dawgtool> ok, I'll give the mailing list a shot.  300k is going to be hard to live under, one of my current systems is  doing server times that a second. =(
> [17:49] <dawgtool> which i was hoping to migrate to a more open system, at least parts of it. =(
> [17:49] <antares_> dawgtool: total # of messages is not the point
> [17:50] <antares_> the point is max # of messages in one queue
> [17:50] <antares_> you can have 10K queues and one app consuming messages from them all
> [17:50] <antares_> unless ordering is a serious concern for your case, it will work just as well as 1 queue
> [17:50] <dawgtool> yea, understand, but some consumers might crash, and I will get a backlog
> [17:51] <dawgtool> I need to make sure the MQ system can handle at least a 30 minute outage
> [17:51] <antares_> again, you will have the same problem with just 1 queue
> [17:51] <antares_> dawgtool: I see. And not lose anything?
> [17:51] <dawgtool> right, =(
> [17:51] <antares_> rabbitmq queues can have message TTL
> [17:52] <dawgtool> yea, I have TTL on metric collection consumers.. usually 6 seconds.
> [17:55] <dawgtool> in production, the idea would be to have two exchanges: input.exchange.fanout, metric.exchange.topic bind to input.exchange.fanout. queue.durable.mirror.prod bind to input.exchange.fanout no ttl, several queue.trans.nomirror.metric[1-x] ttl 6sec.
> [17:56] <dawgtool> the input.exchange.fanout will have a second and third queue eventually, which is why there are two exchanges.
> [17:57] <dawgtool> but the second and third will have a 30min ttl
> [18:01] <dawgtool> anyway, thanks for the info.. I'll shot an email out to the listserv and see if I get any bites. =)
> [18:01] <dawgtool> thanks again. =)
> [18:01] <antares_> dawgtool: no problem
> 
>