[rabbitmq-discuss] Erlang crashes trying to allocate 583848200 bytes of memory

Mon Jan 17 20:18:55 GMT 2011

Hi, Mark...

Thanks very much for your mail!

> Hi, I'm experiencing reproducible crashes of Erlang when running
> RabbitMQ 2.20 on a clean install of 64 bit Windows server 2008 R2 box
> - a parallels virtual machine with 1024 gig.
>
> They seem to be very similar to the problems here :
> http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2010-December/010443.html

We've seen what appears to be this problem elsewhere with users
running on Windows.  Unfortunately, I'm not sure that it's related to
the queue memory leak discussed in the thread above, and the problem
I'm thinking of, which looks like yours, occurs even in 2.2.0.  :-(

> In particular I get
> "broker running
> Crash dump was written to: erl_crash.dump
> eheap_alloc: Cannot allocate 583848200 bytes of memory (of type "old_heap").
> This application has requested the Runtime to terminate it in an unusual way.
> Please contact the application's support team for more information."
> "Cannot allocate 583848200 bytes of memory (of type "old_heap")." is
> the same message talked about in the previous discussion.

What's happened here is that the Erlang runtime has run desperately
short of memory, been unable to allocate what it needs for its own
use, and shut down.  We're not sure quite why this happens to Windows
users at this point.  More below...

> In my scenario, I load a durable queue with between 110k and 130k
> messages -around 900 bytes each- with the consumer off. I then turn on
> the consumer -to simulate the consumer being unable to contact the
> broker, or consumer being down for a period etc. - the Akka consumer
> then starts processing the messages with erl.exe consuming about 50
> meg.

This should be an entirely reasonable scenario.  If the broker is
accumulating messages that aren't being consumed, it should page them
to disk if it gets above 0.4 times the calculated memory high
watermark value (which in turn is computed by multipling your high
watermark config setting by the amount of RAM available to the Erlang
process as detected at startup---on present Windows versions of Erlang
this is limited to 2GB regardless of whether your OS is 64-bit or
not).

The paging should relieve memory pain on the broker; also, producers
should be slowed by back-pressure, as the broker will stop reading
from the sockets to which they're publishing when it's pushed into
paging territory.  Ideally, all of these things should resolve and
stabilize once things catch up, modulo catastrophic resource
exhaustion such as a destination disk used for paging filling up.

> With an empty durable queue, If I start the consumer, then start my
> load test, the queue runs for hours, Akka consumes and erl.exe stays
> around 50meg.

So with a nicely balanced, active consumer, everything works fine?

> It was suggested switching to a Linux environment, this may be
> possible later, but currently Our planned environment is Windows
> Server 2008 R2.

That's good information for us.  The other customer we've seen with
this problem was running Vista and I only had 2008 R2 available to me
to try and reproduce it.  Using their test code I was never able to
reproduce the problem, but I was running under virtualization on
Amazon EC2.  Interestingly you're also running under Virtualization,
although on a Mac, using a different hypervisor.  This is interesting
information but it's not totally clear what its implications are at
this point.

> There was also talk of an "experimental fix for this". Any idea if
> this work, or when this will be part of the next release ?

Alas, at this moment we still aren't sure what the problem is, other
than that it appears to be specific to Windows and/or the Windows
release of Erlang.  It's plausible that the throughput and/or
scheduling of I/O under the Windows version of Erlang is different
enough that by the time the paging mechanism engages, it can't get
sufficiently caught up to the unrequited producers, and Rabbit and the
Erlang enter a bad corner of their state space.

Here's a shot in the partially illuminated dark you might try... 

Check out the discussion here of memory-based flow control:

http://www.rabbitmq.com/extensions.html#memsup

Since you have a test case in front of your that crashes reproducibly,
would you mind trying to set your vm_memory_high_watermark to
something lower than the default, say 0.2, following the instructions
presented at the above page?  That would cause the throttling of
producers to kick in earlier?  It might be interesting to see whether
this makes your system more stable under your test load.  If you do
give this a try, please report back your results---they're likely to
be helpful at refining our understanding of the issue.

> Prior to this load testing, RabbitMQ worked faultlessly.

It may be small comfort given that you can't readily switch to running
under Linux just now, but to my knowledge we've never seen this
problem happen with the broker running on Linux.  Erlang seems much
more widely used and deployed atop Linux and other Unixes, and those
releases seem to get more testing, scrutiny and love, and thus present
fewer surprises.

My apologies that we don't have a great answer for you just now...  If
you do try the suggested experiment of lowering the
vm_memory_high_watermark value, I'd be very interested in seeing how
you make out...

Best regards,
Jerry