[rabbitmq-discuss] Silent crash causes persistent durable message loss

Tue May 22 08:26:54 BST 2012

On 22/05/2012 01:46, Will Koffel wrote:
> Thanks for the feedback guys, sorry for the slow reply back.
>
> I took a look on the machine. It's running on an Amazon Linux instance
> on EC2 (based on CentOS loosely). I don't see anything at all related to
> oom-killer in logs. Nor anything about rabbit. I looked in dmesg, and
> everything in /var/log.
>
> Amazon doesn't do memory monitoring with AWS CloudWatch (major downside,
> blech), so I can't see if there was any memory contention, but we've
> never seen any issues on those machines. There's no CPU or disk spikes
> during that time to indicate anything else traumatic happening on the
> machine in question.
>
> One long-shot idea at the application layer: The daemons we run to
> consume those queues call "declare" on the relevant exchanges and queues
> on startup. This has always been a convenient way to ensure the queues
> are alive before we start writing/reading them. In the case where
> RabbitMQ crashes on us, the daemons start the cycle (they die, their
> parent process restarts them, they die again, etc.).
>
> Is there any case in which WHILE rabbit was starting up, it might accept
> a connection, which would instruct it to create the "expiring-queue",
> and then when it went to restore the persistent messages from disk, it
> would give up since the queue already existed? Any sort of race
> condition that could bite us if there are clients thrashing away during
> start? Or will RabbitMQ fail to accept any connections until it's
> cleanly started up?
>

Hi Will. Rabbit's networking sub-system only comes online after 
everything else is ready, plus the boot sequence is entirely sequential, 
so I'd be very surprised if this was a race on startup. Having said 
that, I'll verify that none of the boot steps run out of band and 
confirm this morning.

> -Will
>
>
> On May 21, 2012, at 5:02 PM, Matthias Radestock wrote:
>
>> On 21/05/12 21:50, Francesco Mazzoli wrote:
>>> I have trouble believing that it is actually dying silently with no
>>> information in the logs.
>>
>> iirc we've seen this in the past with things like 'oom killer'. It is
>> probably worth checking the system logs.
>>
>>> In the meantime I'm going to do the obvious and suggest to upgrade to
>>> 2.8.2. We fixed several ugly bugs related to DLX (one of which was
>>> particularly easy to get) and they might be related to your problem.
>>
>> I don't think any of the DLX bugs pre 2.8.2 would have brought down an
>> entire rabbit, just individual queues.
>>
>> Matthias.
>
> ________________
> Will Koffel
> CTO, Thumb™
> 51 E 12th St., 4th Floor
> New York, NY 10003
> Office: (212) 673-8650
> Mobile: (617) 575-WILL
> @thumb
> www.thumb.it <http://www.thumb.it/>
>
>
>
>
>
>
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss