[rabbitmq-discuss] Possible memory leak in the management plugin

Wed Apr 9 11:33:21 BST 2014

On 09/04/14 06:15, Pavel wrote:
> Hello,

Hello. And first of all, thanks for a very detailed email!

> It
> was quickly identified that mgmt_db is a culprit, so we disabled management
> plugin and attempted to reproduce the issue in the lab.

Based on what you've said below, it looks like the fine grained stats 
are causing the problem, so if you want to re-enable mgmt you might want 
to add

{rabbitmq_management_agent, [{force_fine_statistics, false}]}

in your configuration - this will stop mgmt showing message rates but I 
think it would remove the performance issue you're seeing.

> Q1: There are three ets tables (as far as I understand) that have million+
> items and keep growing rapidly (5754973, 5759070 and 5763167). What are
> those?

Those numbers are just generated IDs, so they will be different for each 
run of the database. But in your case they are:

5754973 - aggregated_stats
5759070 - aggregated_stats_index
5763167 - old_stats

In (very) short, aggregated_stats contains all the stats for things 
which have history, aggregated_stats_index provides some alternate 
indexes to provide fast lookups into aggregated_stats, and old_stats 
retains the (N-1) version of the raw stats emitted by channels (so that 
we can get the delta to the previous one when aggregating).

All of these I would expect to see having a size that is proportional to 
the number of distinct (channel -> exchange), (channel -> exchange -> 
queue) and (queue -> channel) publish and deliver events. So with large 
fanout between channels and exchanges I would expect to see lots of 
records in these tables.

However, I would not expect to see the tables grow indefinitely. They 
should reach a stable size, and then drop back to (comparatively near) 
zero once all the channels are closed (they will still contain 
per-queue, per-exchange, per-vhost records).

So are they growing without bound? What happens when you close all the 
channels?

> Q2: Total memory used by ets storage (~120Mb) is far less than total memory
> used by rabbit_mgmt_db process (~3.5Gb). Is there any way to inspect what
> consumes the most of the memory?

That's the weird part. The vast majority of data for the mgmt DB should 
be in those ETS tables; the process memory should only include the 
database's state record (which you printed with sys:get_status/1, it's 
tiny), its stack (also tiny) and any garbage that hasn't yet been collected.

Hmm. That last one is an interesting question. What does

rabbitmqctl eval 
'erlang:garbage_collect(global:whereis_name(rabbit_mgmt_db)).'

do to your memory use?

The database will be generating heaps (sorry) of garbage in your test; 
Erlang's GC should be kicking in frequently but I wonder if somehow it 
is not.

> It was noticed also, that in the configuration described above (1000
> exchanges x 1 queue each) stats events for publishing channels become very
> big.

Yes, again they contain data for each exchange / queue the channel 
publishes to.

> It also seems that increasing collect_statistics_interval from default
> 5 seconds to 60 seconds helps - memory still goes up after first iteration,
> but then stays within ~2Gb (test was running for at least 1 hour with no
> crash).

That makes sense.

> Here is the output of fprof analysis for ~5 seconds of the rabbit_mgmt_db
> process during the test at a slower publish rate:
> https://gist.github.com/maisenovich/10226881
>
> Q3: Is there any better way to profile Erlang/Rabbit? Attempts to use fprof
> under load that is causing the crash didn't work so far.

Yeah, fprof has a huge performance impact since it is of the "trace 
every function call" school. I am not aware of a sampling profiler for 
Erlang I'm afraid.

> We are suspecting that the issue is caused by massive channel_stats events
> which at the high enough rate causing Erlang GC to start falling behind. The
> function invocation below (from linked profile) is particularly raising
> questions:

That seems sort of plausible. But Erlang does not GC like Java - each 
process has its own heap and the GC takes place "within" that process, 
it's not a separate thread or anything, so it shouldn’t be able to fall 
behind just because a process is busy. I am still very suspicious of GC 
in your case though, forcing a manual GC as above would be very helpful.

> Q4: Any better theories on what might be causing the unbounded memory
> growth? Any GC tuning you would recommend to attempt?

Not yet.

I'd like to verify that GC is the problem first, then we can look at how 
to do it better.

Cheers, Simon

-- 
Simon MacMullen
RabbitMQ, Pivotal