[rabbitmq-discuss] Silent crash causes persistent durable message loss

Wed May 23 09:36:34 BST 2012

On 22 May 2012, at 08:26, Tim Watson wrote:

> On 22/05/2012 01:46, Will Koffel wrote:
>> Thanks for the feedback guys, sorry for the slow reply back.
>> 
>> I took a look on the machine. It's running on an Amazon Linux instance
>> on EC2 (based on CentOS loosely). I don't see anything at all related to
>> oom-killer in logs. Nor anything about rabbit. I looked in dmesg, and
>> everything in /var/log.
>> 
>> Amazon doesn't do memory monitoring with AWS CloudWatch (major downside,
>> blech), so I can't see if there was any memory contention, but we've
>> never seen any issues on those machines. There's no CPU or disk spikes
>> during that time to indicate anything else traumatic happening on the
>> machine in question.
>> 
>> One long-shot idea at the application layer: The daemons we run to
>> consume those queues call "declare" on the relevant exchanges and queues
>> on startup. This has always been a convenient way to ensure the queues
>> are alive before we start writing/reading them. In the case where
>> RabbitMQ crashes on us, the daemons start the cycle (they die, their
>> parent process restarts them, they die again, etc.).
>> 
>> Is there any case in which WHILE rabbit was starting up, it might accept
>> a connection, which would instruct it to create the "expiring-queue",
>> and then when it went to restore the persistent messages from disk, it
>> would give up since the queue already existed? Any sort of race
>> condition that could bite us if there are clients thrashing away during
>> start? Or will RabbitMQ fail to accept any connections until it's
>> cleanly started up?
>> 
> 
> Hi Will. Rabbit's networking sub-system only comes online after everything else is ready, plus the boot sequence is entirely sequential, so I'd be very surprised if this was a race on startup. Having said that, I'll verify that none of the boot steps run out of band and confirm this morning.
> 

Hi Will. Rabbit does appear to be starting up synchronously in all aspects of its boot sequence, so I'm not sure about this idea of a race on startup. 

Quick question for you, just to confirm. When you put messages on the 'expiring queue', I'd like to make sure that all the following conditions are true (to ensure that they will be on disk)

1. the queue is durable (you've said this already)
2. the messages are marked as persistent (you've also said this already)
3. the publisher has received an ACK *or* you're using a transaction

If all 3 of these are in place, then the messages should certainly be on disk after (3).

>> -Will
>> 
>> 
>> On May 21, 2012, at 5:02 PM, Matthias Radestock wrote:
>> 
>>> On 21/05/12 21:50, Francesco Mazzoli wrote:
>>>> I have trouble believing that it is actually dying silently with no
>>>> information in the logs.
>>> 
>>> iirc we've seen this in the past with things like 'oom killer'. It is
>>> probably worth checking the system logs.
>>> 
>>>> In the meantime I'm going to do the obvious and suggest to upgrade to
>>>> 2.8.2. We fixed several ugly bugs related to DLX (one of which was
>>>> particularly easy to get) and they might be related to your problem.
>>> 
>>> I don't think any of the DLX bugs pre 2.8.2 would have brought down an
>>> entire rabbit, just individual queues.
>>> 
>>> Matthias.
>> 
>> ________________
>> Will Koffel
>> CTO, Thumb™
>> 51 E 12th St., 4th Floor
>> New York, NY 10003
>> Office: (212) 673-8650
>> Mobile: (617) 575-WILL
>> @thumb
>> www.thumb.it <http://www.thumb.it/>
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> rabbitmq-discuss at lists.rabbitmq.com
>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
> 
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss