[rabbitmq-discuss] Rabbitmq falling over & losing messages

Mon Dec 1 11:33:05 GMT 2008

On 29 Nov 2008, at 00:06, Ben Hood wrote:

> So are you saying that with the latest version of Rabbit you are still
> losing messages that are marked as persistent (as you indicated in
> your first post)?

Yes.

> Ok, this looks normal for a case when Rabbit runs out of memory
> because you have flooded it with messages.
>
> Currently the only preventative action against this is the
> channel.flow command - see this article for the background :
> http://hopper.squarespace.com/blog/2008/11/9/flow-control-in-rabbitmq.html
>
> ATM producer throttling requires a well behaved client, i.e. one that
> obeys the channel.flow command - the Python client currently isn't
> well behaved in this respect.

Thanks - I'd seen that blog post, but was hoping I wouldn't be running  
into flooding issues quite yet!

>> even though, as best I can tell from watching the output of top,  
>> the erlang
>> process never took more than about 10% of available memory.
>
> Do you not see anything in the log about an alarm handler for  
> memory, e.g.
>
> =INFO REPORT==== 9-Nov-2008::15:13:31 ===
>    alarm_handler: {set,{system_memory_high_watermark,[]}}

No alarm handlers of any sort, nor anything obviously to do with memory.

root at domU-12-31-39-02-61-F6:/tmp# grep -i alarm rabbit.log
root at domU-12-31-39-02-61-F6:/tmp# grep -i memory rabbit.log
root at domU-12-31-39-02-61-F6:/tmp#

 From the start of message sending to the crash, the server log looks  
like:

=INFO REPORT==== 1-Dec-2008::11:25:46 ===
accepted TCP connection on 0.0.0.0:5672 from 127.0.0.1:45049

=INFO REPORT==== 1-Dec-2008::11:25:46 ===
starting TCP connection <0.216.0> from 127.0.0.1:45049

=INFO REPORT==== 1-Dec-2008::11:25:53 ===
Rolling persister log to "/tmp/rabbitmq-rabbit-mnesia/ 
rabbit_persister.LOG.previous"

=INFO REPORT==== 1-Dec-2008::11:25:59 ===
Rolling persister log to "/tmp/rabbitmq-rabbit-mnesia/ 
rabbit_persister.LOG.previous"

=INFO REPORT==== 1-Dec-2008::11:26:06 ===
Rolling persister log to "/tmp/rabbitmq-rabbit-mnesia/ 
rabbit_persister.LOG.previous"

=INFO REPORT==== 1-Dec-2008::11:26:13 ===
Rolling persister log to "/tmp/rabbitmq-rabbit-mnesia/ 
rabbit_persister.LOG.previous"

=INFO REPORT==== 1-Dec-2008::11:26:21 ===
Rolling persister log to "/tmp/rabbitmq-rabbit-mnesia/ 
rabbit_persister.LOG.previous"

=INFO REPORT==== 1-Dec-2008::11:26:30 ===
Rolling persister log to "/tmp/rabbitmq-rabbit-mnesia/ 
rabbit_persister.LOG.previous"

=INFO REPORT==== 1-Dec-2008::11:26:38 ===
Rolling persister log to "/tmp/rabbitmq-rabbit-mnesia/ 
rabbit_persister.LOG.previous"

=INFO REPORT==== 1-Dec-2008::11:26:58 ===
Rolling persister log to "/tmp/rabbitmq-rabbit-mnesia/ 
rabbit_persister.LOG.previous"

=INFO REPORT==== 1-Dec-2008::11:27:18 ===
Rolling persister log to "/tmp/rabbitmq-rabbit-mnesia/ 
rabbit_persister.LOG.previous"

=ERROR REPORT==== 1-Dec-2008::11:27:23 ===
connection <0.216.0> (running), channel 1 - error:
{amqp,internal_error,
       "commit failed: [{exit,{timeout,{gen_server,call,[<0.212.0>, 
{commit,{{1,<0.221.0>},93093}},5000]}}}]",
       'tx.commit'}

=INFO REPORT==== 1-Dec-2008::11:27:23 ===
closing TCP connection <0.216.0> from 127.0.0.1:45049

[followed by a dump of the whole queue]

> I find the memory statistic a bit strange - the alarm handler kicks in
> by default at 95%.
>
> Simon is currently looking into a issue with the way Erlang reports on
> memory consumption on Linux - maybe he can can shed some light on what
> may be going on with your installation.
>
> Also, can you give some more details about your environment? Are you
> running Xen?

Yes; this is on an Amazon EC2 instance. Currently, I'm using just a  
small instance: (1.7 GB of memory, 160 GB of instance storage, 32-bit  
platform) - eventually I'll be running on a larger instance, but I'm  
still working my way up to that; I was trying to calibrate my resource  
usage when I ran into this issue. It's running mostly Ubuntu Hardy,  
but is now using Erlang R12b-3 from Intrepid. Nothing else running on  
the instance except essential services (sshd, cron, etc)

The crash occurs consistently at about 10% memory usage. Memory usage  
actually increases shortly after the crash, up to about 30-40% or so;  
I'm guessing this is erlang formatting the queue object for output to  
the log.

Toby