[rabbitmq-discuss] RabbitMQ crashes hard when it runs out of memory

Fri Nov 6 20:57:33 GMT 2009

On Fri, Nov 6, 2009 at 2:43 AM, Matthew Sackman <matthew at lshift.net> wrote:

> On Fri, Nov 06, 2009 at 10:06:20AM +0000, Matthew Sackman wrote:
> > That's fine. I have to say that it's unlikely this patch will make it
> > through - the memory management code has gone through a lot of change
> > recently as we're getting a much better handle on resource management.
>
Agreed. This is a definitely a workaround fix for the problem. In the
interest of full disclosure, I have gotten rabbitmq to crash with this patch
for the same reason, by getting the memory to spike before excess can be
collected, so this isn't a full fix by any means. I will try to dig further
into the root cause in 1.7.0 release when I have time.

> > Whilst you've obviously been working from the head of our default branch
> > (many thanks!), there are a couple of issues with garbage collecting
> > every process like that, for example, it's possible that garbage
> > collecting vast numbers of processes will take longer than the
> > memory_check_interval, making messages queue up for the memory manager
> > process. This would become a problem if the garbage collection is unable
> > to reclaim any memory at all - eg millions of queues, all of which are
> > empty.
>
However, with rabbit's current memory problems, I wouldn't run it with more
than 10 queues, let alone millions ;). I will give the java client test a
try.

>
> Some immediate ideas to improve this a little.
> 1) Only do the GC when you initially hit the memory alarm. I.e. in the
> first case when going from non-alarmed to alarmed, put the gc in there,
> then maybe recurse again (though you'll likely want another param on the
> function to stop infinite recursion).
>
This was my initial thought as well. However, by the time the alarm goes
off, it is often too late for this to stop rabbit from crashing. For
example, the default memory alarm is set to 0.95. Many of the crashes where
to due to a failed allocation of 200-300 MB by the VM. At 4GB (rounding
*up*), 200 MB is the limit of the allocation that can be made before garbage
collecting (1 - .2/4 = 0.95). So, if more is allocated, the server is dead
in the water.

> 2) Only put GC in processes that are known to eat lots of RAM. Eg if
> it's the persister, then putting in a manual GC right after it does a
> snapshot is probably a good idea.
>
The persister is indeed where the binary memory is hanging around, but I am
not sure if the snapshot point is the problem. The crashes happen when
adding batches into the queue. Its like the persister can't keep up.

For now, I will try this at the same point shown in the patch, but only do
the persister:

garbage_collect(whereis(rabbit_persister)).

I would say the correct pattern would be something like python's MemoryError
or handling of a malloc failure in C. Is there an exception that is thrown,
where a garbage collect can be run?

_steve
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20091106/2e712c63/attachment.htm