[rabbitmq-discuss] RabbitMQ Stability Issues with large queue - 2011-12-28
DawgTool
dawgtool at aol.com
Wed Dec 28 22:20:20 GMT 2011
Hi All,
Ran the publisher against a single node (removed the clustering) and was able to grab the report just before crashing.
Here are the details:
==> dc001.log <==
=INFO REPORT==== 28-Dec-2011::17:00:00 ===
vm_memory_high_watermark set. Memory used:8662988728 allowed:5153960755
=INFO REPORT==== 28-Dec-2011::17:00:00 ===
alarm_handler: {set,{{vm_memory_high_watermark,'dc001 at rmquat-m01'},[]}}
=WARNING REPORT==== 28-Dec-2011::17:00:00 ===
Mnesia('dc001 at rmquat-m01'): ** WARNING ** Mnesia is overloaded: {dump_log, time_threshold}
Reporting server status on {{2011,12,28},{22,6,3}}
Status of node 'dc001 at rmquat-m01' ...
[{pid,13173},
{running_applications,
[{rabbitmq_management_visualiser,"RabbitMQ Visualiser","2.7.0"},
{rabbitmq_management,"RabbitMQ Management Console","2.7.0"},
{rabbitmq_mochiweb,"RabbitMQ Mochiweb Embedding","2.7.0"},
{webmachine,"webmachine","1.7.0-rmq2.7.0-hg"},
{amqp_client,"RabbitMQ AMQP Client","2.7.0"},
{rabbitmq_management_agent,"RabbitMQ Management Agent","2.7.0"},
{rabbit,"RabbitMQ","2.7.0"},
{mnesia,"MNESIA CXC 138 12","4.4.17"},
{os_mon,"CPO CXC 138 46","2.2.5"},
{sasl,"SASL CXC 138 11","2.1.9.3"},
{mochiweb,"MochiMedia Web Server","1.3-rmq2.7.0-git"},
{inets,"INETS CXC 138 49","5.5.2"},
{stdlib,"ERTS CXC 138 10","1.17.3"},
{kernel,"ERTS CXC 138 10","2.14.3"}]},
{os,{unix,linux}},
{erlang_version,
"Erlang R14B02 (erts-5.8.3) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:30] [hipe] [kernel-poll:true]\n"},
{memory,
[{total,8825539568},
{processes,4061532008},
{processes_used,4061085200},
{system,4764007560},
{atom,1677993},
{atom_used,1659957},
{binary,4448013416},
{code,16452002},
{ets,295062344}]},
{vm_memory_high_watermark,0.5999999999767169},
{vm_memory_limit,5153960755}]
Cluster status of node 'dc001 at rmquat-m01' ...
[{nodes,[{disc,['dc001 at rmquat-m01']}]},{running_nodes,['dc001 at rmquat-m01']}]
Application environment of node 'dc001 at rmquat-m01' ...
[{auth_backends,[rabbit_auth_backend_internal]},
{auth_mechanisms,['PLAIN','AMQPLAIN']},
{backing_queue_module,rabbit_variable_queue},
{cluster_nodes,[]},
{collect_statistics,fine},
{collect_statistics_interval,5000},
{default_permissions,[<<".*">>,<<".*">>,<<".*">>]},
{default_user,<<"guest">>},
{default_user_tags,[administrator]},
{default_vhost,<<"/">>},
{delegate_count,16},
{error_logger,{file,"/data/rabbitmq/dc001/log/dc001.log"}},
{frame_max,131072},
{hipe_compile,true},
{included_applications,[]},
{msg_store_file_size_limit,16777216},
{msg_store_index_module,rabbit_msg_store_ets_index},
{queue_index_max_journal_entries,262144},
{sasl_error_logger,{file,"/data/rabbitmq/dc001/log/dc001-sasl.log"}},
{server_properties,[]},
{ssl_listeners,[]},
{ssl_options,[]},
{tcp_listen_options,[binary,
{packet,raw},
{reuseaddr,true},
{backlog,128},
{nodelay,true},
{exit_on_close,false}]},
{tcp_listeners,[5672]},
{trace_vhosts,[]},
{vm_memory_high_watermark,0.6}]
Connections:
pid address port peer_address peer_port ssl peer_cert_subject peer_cert_issuer peer_cert_validity auth_mechanism ssl_protocol ssl_key_exchange ssl_cipher ssl_hash protocol user vhost timeout frame_max client_properties recv_oct recv_cnt send_oct send_cnt send_pend state channels
<'dc001 at rmquat-m01'.1.4461.3> 192.168.0.100 5672 192.168.0.97 33792 false PLAIN {0,8,0} guest / 0 131072 [] 79447230 86718 292 4 0 blocked 1
Channels:
pid connection number user vhost transactional confirm consumer_count messages_unacknowledged messages_unconfirmed messages_uncommitted acks_uncommitted prefetch_count client_flow_blocked
<'dc001 at rmquat-m01'.1.4465.3> <'dc001 at rmquat-m01'.1.4461.3> 1 guest / false false 0 0 0 0 0 0 false
Queues on /:
pid name durable auto_delete arguments owner_pid slave_pids synchronised_slave_pids exclusive_consumer_pid exclusive_consumer_tag messages_ready messages_unacknowledged messages consumers memory backing_queue_status
<'dc001 at rmquat-m01'.1.9112.0> dc.data:durable:all true false [] 4589172 0 4589172 0 4041816264 [{q1,0}, {q2,0}, {delta,{delta,2667892,1921280,4589172}}, {q3,132544}, {q4,2535348}, {len,4589172}, {pending_acks,0}, {target_ram_count,0}, {ram_msg_count,2535348}, {ram_ack_count,0}, {next_seq_id,4589172}, {persistent_count,0}, {avg_ingress_rate,0.0}, {avg_egress_rate,0.0}, {avg_ack_ingress_rate,0.0}, {avg_ack_egress_rate,0.0}]
Exchanges on /:
name type durable auto_delete internal arguments
amq.direct direct true false false []
dc.data:fanout fanout true false false []
amq.topic topic true false false []
amq.rabbitmq.trace topic true false false []
amq.rabbitmq.log topic true false false []
amq.fanout fanout true false false []
amq.headers headers true false false []
direct true false false []
amq.match headers true false false []
Bindings on /:
source_name source_kind destination_name destination_kind routing_key arguments
exchange dc.data:durable:all queue dc.data:durable:all []
dc.data:fanout exchange dc.data:durable:all queue []
Consumers on /:
Permissions on /:
user configure write read
guest .* .* .*
End of server status report
...done.
On Dec 28, 2011, at 9:22 AM, DawgTool wrote:
> RabbitMQ Stability Issues with large queue - 2011-12-28
>
> Hi All,
>
> I posted in the IRC channel a few nights ago, and they suggested that I bring this to the listserv.
> Hopefully can get some suggestions on how to keep my servers from crashing.
> Thanks,
> -- DawgTool
>
> Cluster Info:
> Cluster status of node 'dc001 at rmquat-m01' ...
> [{nodes,[{disc,['dc001 at rmquat-m04','dc001 at rmquat-m03','dc001 at rmquat-m02',
> 'dc001 at rmquat-m01']}]},
> {running_nodes,['dc001 at rmquat-m04','dc001 at rmquat-m03','dc001 at rmquat-m02',
> 'dc001 at rmquat-m01']}]
>
> Config Info:
> ==> enabled_plugins <==
> [rabbitmq_management,rabbitmq_management_agent,rabbitmq_management_visualiser].
>
> ==> rabbitmq.config <==
> [
> {rabbit, [{vm_memory_high_watermark, 0.6},
> {collect_statistics_interval, 5000},
> {hipe_compile, true}
> ]
> },
> {rabbitmq_management, [ {http_log_dir, "/data/rabbitmq/dc001/rabbit-mgmt"} ] },
> {rabbitmq_management_agent, [ {force_fine_statistics, true} ] }
> ].
>
> ==> rabbitmq-env.conf <==
> NODENAME=dc001
> BASE=/data/rabbitmq/dc001
> MNESIA_BASE=/data/rabbitmq/dc001/mnesia
> LOG_BASE=/data/rabbitmq/dc001/log
> SERVER_START_ARGS="-smp enable"
>
>
> IRC:
> [17:32] <dawgtool> background. doing some testing on 2.7.0 : 4 servers 2vcpu, 8gb ram, 80gb disc.
> [17:32] <dawgtool> cluser is setup all disc, currently one exchange durable fanout
> [17:33] <dawgtool> one queue also durable bind to the exchange.
> [17:33] <dawgtool> i'm pushing about 5M records, payload is ~500bytes each record
> [17:34] <dawgtool> rate is about 14k/s (which seems pretty slow)
> [17:35] <dawgtool> but my problem is, I'm testing a case where they consumers are busy or unavailable, so the queues would be filling up.
> [17:35] <dawgtool> even after slowing the publish rate to about 4k/s the mirrored queue does not complete on any of the clusters nodes other then master.
> [17:37] <dawgtool> memory seems to be the biggest issue here, as the servers will grow passed the high water mark, and eventually crash one at a time.
> [17:37] <dawgtool> once they are restarted, most servers in the cluster will have about 200k to 300k of messages in their queue
> [17:40] <dawgtool> so question is, why is so much memory being consumed (on disk these records are about 5.5GB) RabbitMQ pushes to 7.9 real, 11.9 virtual (swapping).
> [17:40] <dawgtool> why is the queue not stopping the publishers (RAM based clusters seem to stop the publisher until it can be spilled to disk)
> [17:41] <dawgtool> Why is mirroring unreliable in this test.
> [17:41] <dawgtool> ok, i'm done with the backgroud. =)
> [17:41] <dawgtool> lol
> [17:42] <antares_> having 300K messages in one queue will result in RAM consumption like this
> [17:42] <antares_> 30 queues with 10K is a better option
> [17:43] <antares_> I can't say for mirroring but my understanding is that mirroring cannot decrease RAM usage
> [17:43] <dawgtool> true, i need to make sure I don't loose any records, so disc with mirror
> [17:44] <dawgtool> does performance get that bad after 300k message in a queue?
> [17:44] <antares_> this kind of questions is better asked on the rabbitmq-discuss (mailing list)
> [17:45] <antares_> the exact number will vary but yes, having queues with excessive # of messages that are not consumed will result in more or less this behavior
> [17:45] <dawgtool> yea, just joined the mailing list, I can post it there. was hoping someone had a quick answer. =)
> [17:45] <antares_> each queue is backed by one Erlang process and Erlang VM GC releases process memory all at once
> [17:46] <dawgtool> even with the -smp set to true?
> [17:46] <antares_> so having large amount of messages in your queues impedes that
> [17:46] <antares_> dawgtool: I doubt that -smp affects GC behavior
> [17:46] <dawgtool> hmmm
> [17:47] <antares_> in this regard anyway, because there is still no shared heap
> [17:47] <antares_> but rabbitmq team members definitely know more than I do
> [17:47] <antares_> and they are all on rabbitmq-discuss
> [17:48] <dawgtool> ok, I'll give the mailing list a shot. 300k is going to be hard to live under, one of my current systems is doing server times that a second. =(
> [17:49] <dawgtool> which i was hoping to migrate to a more open system, at least parts of it. =(
> [17:49] <antares_> dawgtool: total # of messages is not the point
> [17:50] <antares_> the point is max # of messages in one queue
> [17:50] <antares_> you can have 10K queues and one app consuming messages from them all
> [17:50] <antares_> unless ordering is a serious concern for your case, it will work just as well as 1 queue
> [17:50] <dawgtool> yea, understand, but some consumers might crash, and I will get a backlog
> [17:51] <dawgtool> I need to make sure the MQ system can handle at least a 30 minute outage
> [17:51] <antares_> again, you will have the same problem with just 1 queue
> [17:51] <antares_> dawgtool: I see. And not lose anything?
> [17:51] <dawgtool> right, =(
> [17:51] <antares_> rabbitmq queues can have message TTL
> [17:52] <dawgtool> yea, I have TTL on metric collection consumers.. usually 6 seconds.
> [17:55] <dawgtool> in production, the idea would be to have two exchanges: input.exchange.fanout, metric.exchange.topic bind to input.exchange.fanout. queue.durable.mirror.prod bind to input.exchange.fanout no ttl, several queue.trans.nomirror.metric[1-x] ttl 6sec.
> [17:56] <dawgtool> the input.exchange.fanout will have a second and third queue eventually, which is why there are two exchanges.
> [17:57] <dawgtool> but the second and third will have a 30min ttl
> [18:01] <dawgtool> anyway, thanks for the info.. I'll shot an email out to the listserv and see if I get any bites. =)
> [18:01] <dawgtool> thanks again. =)
> [18:01] <antares_> dawgtool: no problem
>
>
More information about the rabbitmq-discuss
mailing list