[rabbitmq-discuss] RabbitMQ Stability Issues with large queue - 2011-12-28

Wed Dec 28 14:22:55 GMT 2011

RabbitMQ Stability Issues with large queue - 2011-12-28

Hi All,

I posted in the IRC channel a few nights ago, and they suggested that I bring this to the listserv.
Hopefully can get some suggestions on how to keep my servers from crashing.
Thanks,
-- DawgTool

Cluster Info:
Cluster status of node 'dc001 at rmquat-m01' ...
[{nodes,[{disc,['dc001 at rmquat-m04','dc001 at rmquat-m03','dc001 at rmquat-m02',
                'dc001 at rmquat-m01']}]},
 {running_nodes,['dc001 at rmquat-m04','dc001 at rmquat-m03','dc001 at rmquat-m02',
                 'dc001 at rmquat-m01']}]

Config Info:
==> enabled_plugins <==
[rabbitmq_management,rabbitmq_management_agent,rabbitmq_management_visualiser].

==> rabbitmq.config <==
[
  {rabbit,                    [{vm_memory_high_watermark, 0.6},
                               {collect_statistics_interval, 5000},
                               {hipe_compile, true}
                              ]
  },
  {rabbitmq_management,       [ {http_log_dir, "/data/rabbitmq/dc001/rabbit-mgmt"} ] },
  {rabbitmq_management_agent, [ {force_fine_statistics, true} ] }
].

==> rabbitmq-env.conf <==
NODENAME=dc001
BASE=/data/rabbitmq/dc001
MNESIA_BASE=/data/rabbitmq/dc001/mnesia
LOG_BASE=/data/rabbitmq/dc001/log
SERVER_START_ARGS="-smp enable"

IRC:
[17:32] <dawgtool> background. doing some testing on 2.7.0 : 4 servers 2vcpu, 8gb ram, 80gb disc.
[17:32] <dawgtool> cluser is setup all disc, currently one exchange durable fanout
[17:33] <dawgtool> one queue also durable bind to the exchange.
[17:33] <dawgtool> i'm pushing about 5M records, payload is ~500bytes each record
[17:34] <dawgtool> rate is about 14k/s (which seems pretty slow)
[17:35] <dawgtool> but my problem is, I'm testing a case where they consumers are busy or unavailable, so the queues would be filling up.
[17:35] <dawgtool> even after slowing the publish rate to about 4k/s the mirrored queue does not complete on any of the clusters nodes other then master.
[17:37] <dawgtool> memory seems to be the biggest issue here, as the servers will grow passed the high water mark, and eventually crash one at a time.
[17:37] <dawgtool> once they are restarted, most servers in the cluster will have about 200k to 300k of messages in their queue
[17:40] <dawgtool> so question is, why is so much memory being consumed (on disk these records are about 5.5GB) RabbitMQ pushes to 7.9 real, 11.9 virtual (swapping).
[17:40] <dawgtool> why is the queue not stopping the publishers (RAM based clusters seem to stop the publisher until it can be spilled to disk)
[17:41] <dawgtool> Why is mirroring unreliable in this test.
[17:41] <dawgtool> ok, i'm done with the backgroud. =)
[17:41] <dawgtool> lol
[17:42] <antares_> having 300K messages in one queue will result in RAM consumption like this
[17:42] <antares_> 30 queues with 10K is a better option
[17:43] <antares_> I can't say for mirroring but my understanding is that mirroring cannot decrease RAM usage
[17:43] <dawgtool> true, i need to make sure I don't loose any records, so disc with mirror
[17:44] <dawgtool> does performance get that bad after 300k message in a queue?
[17:44] <antares_> this kind of questions is better asked on the rabbitmq-discuss (mailing list)
[17:45] <antares_> the exact number will vary but yes, having queues with excessive # of messages that are not consumed will result in more or less this behavior
[17:45] <dawgtool> yea, just joined the mailing list, I can post it there. was hoping someone had a quick answer. =)
[17:45] <antares_> each queue is backed by one Erlang process and Erlang VM GC releases process memory all at once
[17:46] <dawgtool> even with the -smp set to true?
[17:46] <antares_> so having large amount of messages in your queues impedes that
[17:46] <antares_> dawgtool: I doubt that -smp affects GC behavior
[17:46] <dawgtool> hmmm
[17:47] <antares_> in this regard anyway, because there is still no shared heap
[17:47] <antares_> but rabbitmq team members definitely know more than I do
[17:47] <antares_> and they are all on rabbitmq-discuss
[17:48] <dawgtool> ok, I'll give the mailing list a shot.  300k is going to be hard to live under, one of my current systems is  doing server times that a second. =(
[17:49] <dawgtool> which i was hoping to migrate to a more open system, at least parts of it. =(
[17:49] <antares_> dawgtool: total # of messages is not the point
[17:50] <antares_> the point is max # of messages in one queue
[17:50] <antares_> you can have 10K queues and one app consuming messages from them all
[17:50] <antares_> unless ordering is a serious concern for your case, it will work just as well as 1 queue
[17:50] <dawgtool> yea, understand, but some consumers might crash, and I will get a backlog
[17:51] <dawgtool> I need to make sure the MQ system can handle at least a 30 minute outage
[17:51] <antares_> again, you will have the same problem with just 1 queue
[17:51] <antares_> dawgtool: I see. And not lose anything?
[17:51] <dawgtool> right, =(
[17:51] <antares_> rabbitmq queues can have message TTL
[17:52] <dawgtool> yea, I have TTL on metric collection consumers.. usually 6 seconds.
[17:55] <dawgtool> in production, the idea would be to have two exchanges: input.exchange.fanout, metric.exchange.topic bind to input.exchange.fanout. queue.durable.mirror.prod bind to input.exchange.fanout no ttl, several queue.trans.nomirror.metric[1-x] ttl 6sec.
[17:56] <dawgtool> the input.exchange.fanout will have a second and third queue eventually, which is why there are two exchanges.
[17:57] <dawgtool> but the second and third will have a 30min ttl
[18:01] <dawgtool> anyway, thanks for the info.. I'll shot an email out to the listserv and see if I get any bites. =)
[18:01] <dawgtool> thanks again. =)
[18:01] <antares_> dawgtool: no problem