[rabbitmq-discuss] Fwd: Rabbitmq falling over & losing messages

Mon Dec 1 19:25:44 GMT 2008

---------- Forwarded message ----------
From: Toby White <toby.o.h.white at googlemail.com>
Date: Mon, 1 Dec 2008 18:11:38 +0000
Subject: Re: [rabbitmq-discuss] Rabbitmq falling over & losing messages
To: Ben Hood <0x6e6562 at gmail.com>

Thanks for the explanation; that makes sense.

Am I right in thinking that the rate of message arrival is unimportant
- ie that if I fed the messages in more slowly (still without
consumers) I'd still see the same behaviour, of Erlang aggressively
trying to grab too much memory, failing, and crashing?

I should clarify, I don't intend that there not be any consumers - but
the intended workflow was that large numbers of messages are
periodically dumped to a queue, while workers constantly consume at a
much lower instantaneous rate (though obviously the long-term ingress
and egress rates will be equal). I'd hoped to be able to dump more
messages, but for the moment I can get round this by sending fewer,
larger, messages.

I think you're probably right that it makes most sense to solve this
issue as part of the more general issue of paging queues to disk -
this is certainly not blocking me now.

Toby

On 1 Dec 2008, at 17:00, Ben Hood wrote:

> Toby,
>
> On Mon, Dec 1, 2008 at 11:33 AM, Toby White
> <toby.o.h.white at googlemail.com> wrote:
>>> Do you not see anything in the log about an alarm handler for
>>> memory, e.g.
>>>
>>> =INFO REPORT==== 9-Nov-2008::15:13:31 ===
>>>  alarm_handler: {set,{system_memory_high_watermark,[]}}
>
> I've tried to simulate your test locally and get the same results,
> albeit with different figures. The issue is that even with persistent
> messaging, you are still bound by memory because Rabbit basically
> copies the in-memory copy of the message to the journal. You can see
> this by the fact that Erlang tries to allocate ~700MB of memory in one
> hit:
>
>> Crash dump was written to: erl_crash.dump
>> eheap_alloc: Cannot allocate 729810240 bytes of memory (of type
>> "heap").
>> Aborted
>
> Because the increment is so big, the memory supervisor does not get a
> chance to set the high watermark and let producer flow control
> throttle the rate of production because the interpreter has already
> died.
>
> This is caused by the fact that you have nothing consuming messages
> and hence the persister log is growing continuously. Because of the
> way that the snapshot mechanism currently works, an increasing amount
> of memory is required to create snapshots of a growing journal.
>
> A solution to this may be to refactor the persister so that it can
> handle large surges in persistent message volumes without having to
> rely on something draining it.
>
> However, we are looking into the whole disk based queue paging area in
> general and it may be more appropriate to incorporate a solution for
> the symptom you are seeing into that work.
>
> Another reason to not just *optimize* the persister is also holistic -
> even if you can write stuff down to disk in a smooth fashion, you are
> still going to run into the issue that all messages are still kept in
> memory.
>
> HTH,
>
> Ben