[rabbitmq-discuss] Erlang has closed

Thu Sep 17 16:43:53 BST 2009

> I've seen this symptom before, but the circumstances were a bit
> contrived - the way the output from uptime was being processed on OSX
> - see http://erlang.org/pipermail/erlang-questions/2008-May/034889.html.
>
> However, I'm not convinced that this has anything to do with what you
> are seeing.
>
> Can you provide us with some more details of the environment? Is it
> possible to isolate the code that can reproduce this?

It's Rabbit 1.6.0 on linux, 2.6.27 kernel.  The machine has a degraded
(rebuilding) RAID5 array, so its disk IO is stuck at a max of around
6MB/s.  At the time of the errors, the system load was between 8 and
15, for an 8-core machine with 4GB RAM.  There's a chance that it was
also swapping, but I wasn't watching that.  Its swap isn't empty, so
it certainly has swapped at some point.

The system was processing a new chunk of data, which took around 8
hours to complete.  From the MQ point of view, there were probably a
few million messages enqueued (well, at least dispatched to consumers;
I realize that rabbit tries really hard not to actually "enqueue"
messages), but no more than 100,000 messages at any one time.  I can
try simulating the message flows on a system with a really high load
to see if I can make the problem occur reliably, but we have rabbit on
a few dozen machines, and it doesn't look like it happens often.

I can give more details if you can think of anything more specific
that would be useful.  I can't share our source code, but I'll try to
come up with a repeatable test case.