[rabbitmq-discuss] Crash with RabbitMQ 3.1.5

Wed Oct 16 18:01:24 BST 2013

On 16 Oct 2013, at 16:34, David Harrison <dave.l.harrison at gmail.com> wrote:

> Quick update on the queue count: 56

Hmm. That seems perfectly reasonable.

> On 17 October 2013 02:29, David Harrison <dave.l.harrison at gmail.com> wrote:
> 
> What version of rabbit are you running, and how was it installed?
> 
> 3.1.5, running on Ubuntu Precise, installed via deb package.

Of course - I missed that in the subject line. 

> I think 3.1.5 is the latest stable ??

Yep.

> 
> I'll take a look, we saw a few "too many processes" messages,
> 

That's not a good sign. I can't say we've run into that very frequently - it is possible to raise the limit (on the number of processes), but I suspect that's not the root of this anyway.

> "Generic server net_kernel terminating" followed by :
> 
> ** Reason for termination ==
> ** {system_limit,[{erlang,spawn_opt,

Yeah - once that goes you're in trouble. That's an unrecoverable error, the equivalent of crashing the jvm.

> There was definitely a network partition, but the whole cluster nose dived during the crash
>  

Yeah, partitions are bad and can even become unrecoverable without restarts (which is why we warn against using clustering in some environments), but what you're experiencing shouldn't happen.

> 
> > The erl_crash.dump slogan error was : eheap_alloc: Cannot allocate 2850821240 bytes of memory (of type "old_heap").
> >
> 
> That's a plain old OOM failure. Rabbit ought to start deliberately paging messages to disk well before that happens, which might also explain a lot of the slow/unresponsive-ness.
> 
> These hosts aren't running swap, we give them a fair bit of RAM (gave them even more now as part of a possible stop gap) 
>  

This. I suspect the root of your problem is that you don't have any available swap and somehow ran out I memory. Rabbit should've been paging to disk (by hand, not via swap) once you got within a tolerance level of the high watermark, which is why I'd like to see logs if possible since we might be able to identify what led to runaway process spawns and memory allocation during the partition. My money, for the memory use part, is on error_logger, which has been known to blow up in this way when flooded with large logging terms. During a partition, various things can go wrong leading to crashing processes such as queues, some of which can have massive state that RTS logged leading to potential oom situations like this one. Replacing error logger has been in our radar before, but we've not had strong enough reasons to warrant the expense. If what you've seen can be linked to that however...

To properly diagnose what you've seen though, I will new to get my hands on those logs. Can we arrange that somehow?  

Cheers,
Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131016/8b213b8d/attachment.htm>