[rabbitmq-discuss] rabbit disk_mode branch eating up all RAM, including swap, dying

Mon Oct 5 11:04:21 BST 2009

Hi Brian,

On Sun, Oct 04, 2009 at 09:03:28AM -0400, Brian Whitman wrote:
> Hi, we're using 184cb96f7846+ (bug20980) and our host alerted us that rabbit
> was eating up all available swap on a 16GB real + 8GB swap machine.

20980 stopped getting development work some time ago. As per my recent
email to the list, development work is currently focussed on moving away
from using mnesia. I wouldn't really recommend using 20980 any more -
the branches that it grew into which then went through QA did catch some
bugs which will be present in 20980.

> 
> """
> PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  SWAP COMMAND
> 18445 rabbitmq  18   0 24.7g  14g 1696 S 1087.1 91.7   2268:18  10g beam.smp
> 
> In an effort to prevent kernel panic, we restarted the rabbitmq service,
> freeing up a considerable amount of swap:
> 
> However, the rabbitmq server is not starting again as expected, due to the
> following exception:
> 
> 2009-10-04 06:26:29.797201500 {"init terminating in
> do_boot",{{nocatch,{error,{cannot_start_application,rabbit,{{timeout_waiting_for_tables,[rabbit_disk_queue]},{rabbit,start,[normal,[]]}}}}},[{init,start_it,1},{init,start_em,1}]}}
> """

Hmm, interesting. Is this in a clustered setup, or unclustered?

> They had to delete the mnesia folder (losing all our disk-backed queues) and
> restart\, now it's fine. I would guess that this breakage coincided with us
> storing quite a large number of unacked messages in the queues (job
> instructions for a very large batch)

How many messages did you have in there, and do you know the average
size?

> a) Would upgrading this branch fix this? We were avoiding doing so because
> things were relatively stable.

The work in 20980 went onto other branches but has not gone onto default
yet because of issues we uncovered in the persister design. Thus the
default branch has the same persister as in v1.6. You would probably be
better off using branch bug21444, which has had the benefit of a lot of
QA attention and bug fixes. That said, all the usual warnings about
using unreleased code do apply.

> b) is there anything else I can look at to debug? The logs don't have
> anything of importance.

Not really - the clustering code was wrong in 20980 for a long time, so
if you were in a clustered setup, I'd blame that - and the error that
you've got would support that too. However, if you just had billions of
messages in there, then even in a non clustered setup, I could believe
mnesia would be taking a long time to start up and that might cause the
above error.

Best wishes,

Matthew