[rabbitmq-discuss] Possible memory leak in the management plugin

Thu Apr 10 20:19:52 BST 2014

> That table [aggregated_stats] contains one row for every
> combination of things that can show message rates, and each row contains
> some history for that thing.
>
> The GCing is about deleting old history from each row. This is a
> relatively expensive operation, so the DB loops round GCing 1% of rows
> (or 100 rows, whichever is larger) every 5s. That means that we can keep
> a bit more history around than we're configured to, just because we
> haven't got round to GCing it yet.

It sounds like 

1) If at least 1 message is published to each channel/exchange/queue within
retention period (10s by default for detailed stats) - entire history data
will be kept in memory (aggregated_stats) and solely rely on clean up
process ("GC-ing") to remove old records. 

2) Because the GC-ing process has constant rate (100 or 1% rows each 5
seconds), there is always a possibility that aggregated_stats table growth
will outpace the clean up efforts. In other words, if (events rate x number
of data points) is high enough, mgmt_db will grow continuously until Rabbit
hits high watermark and starts to throttle publishers/consumers.

This is very concerning as while Rabbit might be stable at the moment it
could begin to explode some time in a future depending on concurrency. 

As far as I can tell, currently rows for GC are selected randomly. Is there
a way cycle them so that entire aggregated_stats table is guaranteed to get
a GC pass every 500 seconds (if size is 10000+)? This should increase the
concurrency required to outpace GC.

Also, would you consider making GC_MIN_ROWS and GC_MIN_RATIO configurable
externally? It will give some room for tuning, but essentially the best
solution would be to make the GC-ing process dynamic and memory aware.

Speaking of which, what you think about monitoring the aggregated_stats
table size and dynamically changing the row count for GC-ing process run up
to a certain threshold? Since it can't grow forever (supposedly, due to
performance implications you mentioned), in addition it would be great to
have some sort of high watermark based throttling for fine grained stats,
similar to how transient queues begin to page out to disk at certain level.
So in the event of GC-ing process falling behind aggregated_stats will not
consume all RAM and force Rabbit to throttle publishers, but instead
downgrade the monitoring level (and alert!) as a prevention measure.

Thanks!

--
View this message in context: http://rabbitmq.1065348.n5.nabble.com/Possible-memory-leak-in-the-management-plugin-tp27414p34779.html
Sent from the RabbitMQ mailing list archive at Nabble.com.