[rabbitmq-discuss] Silent crash causes persistent durable message loss

Will Koffel will at thumb.it
Tue May 22 01:46:43 BST 2012


Thanks for the feedback guys, sorry for the slow reply back.

I took a look on the machine.  It's running on an Amazon Linux instance on EC2 (based on CentOS loosely).  I don't see anything at all related to oom-killer in logs.  Nor anything about rabbit.  I looked in dmesg, and everything in /var/log.

Amazon doesn't do memory monitoring with AWS CloudWatch (major downside, blech), so I can't see if there was any memory contention, but we've never seen any issues on those machines.  There's no CPU or disk spikes during that time to indicate anything else traumatic happening on the machine in question.

One long-shot idea at the application layer:  The daemons we run to consume those queues call "declare" on the relevant exchanges and queues on startup.  This has always been a convenient way to ensure the queues are alive before we start writing/reading them.  In the case where RabbitMQ crashes on us, the daemons start the cycle (they die, their parent process restarts them, they die again, etc.). 

Is there any case in which WHILE rabbit was starting up, it might accept a connection, which would instruct it to create the "expiring-queue", and then when it went to restore the persistent messages from disk, it would give up since the queue already existed?  Any sort of race condition that could bite us if there are clients thrashing away during start?  Or will RabbitMQ fail to accept any connections until it's cleanly started up?

-Will


On May 21, 2012, at 5:02 PM, Matthias Radestock wrote:

> On 21/05/12 21:50, Francesco Mazzoli wrote:
>> I have trouble believing that it is actually dying silently with no
>> information in the logs.
> 
> iirc we've seen this in the past with things like 'oom killer'. It is
> probably worth checking the system logs.
> 
>> In the meantime I'm going to do the obvious and suggest to upgrade to
>> 2.8.2. We fixed several ugly bugs related to DLX (one of which was
>> particularly easy to get) and they might be related to your problem.
> 
> I don't think any of the DLX bugs pre 2.8.2 would have brought down an
> entire rabbit, just individual queues.
> 
> Matthias.

________________
Will Koffel
CTO, Thumb™
51 E 12th St., 4th Floor
New York, NY 10003
Office: (212) 673-8650
Mobile: (617) 575-WILL
@thumb
www.thumb.it






-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120521/20a55f60/attachment.htm>


More information about the rabbitmq-discuss mailing list