[rabbitmq-discuss] Help pinpointing an error

Fri Aug 31 10:56:00 BST 2012

Hi

On 31 Aug 2012, at 10:35, Jaime Herazo B. wrote:

> Hi.
> 
> Today a rabbit instance went down. I restarted the service only to be
> greeted by screams as apparently many messages were lost in the process
> (as far as i understood, once queues were marked as "Durable" this
> couldn't happen, but it happened).
> 

This shouldn't happen, so let's try and figure out what went wrong.

> The "Reason for termination" was:
> 
> {{badmatch,[{file_summary,2064936,4810835,2064935,2064937,16780759,true,1}]},                                                                                                             
> [
>  {rabbit_msg_store,combine_files,3},
>  {rabbit_msg_store_gc,attempt_action,3},
>  {rabbit_msg_store_gc,handle_cast,2},
>  {gen_server2,handle_msg,2},
>  {proc_lib,wake_up,3}
> ]
> }
> 

The rabbit_msg_store module handles persisting the contents (i.e., data) for durable queues on disk. The badmatch (which means we didn't see the data we expected to see when assigning something) occurs because the 'readers' field for the file_summary is expected to be 0, not 1. This routine is called when compacting the data (e.g., during a garbage collection-esque process) and is called when the message store is initialising, so my reading thus far is that we've somehow ended up with a process trying to read from the message store before it's properly initialised.

> I'm having trouble even identifying what does this mean, let alone
> preventing it from happening again. It started just fine, so it was
> probably a transient error, but the fact that it took with it all the
> messages in the queue is troubling.
> 

Indeed. We must stop this from happening.

> Can you please point me towards more resources to handle these kinds of
> problems in the future that don't involve loss of data? What did i do
> wrong?
> 
> Also, do you see a hint of what went wrong there, or do i need to give
> more info for this?
> 

Someone better versed in the mechanics of the message store may chime in with a good explanation, but for my part I'd like to understand a few more things:

1. how did your rabbit go down (crashed, accidentally restarted, etc)?
2. exactly what steps did you take to restart it
3. what kind of configuration do you have (is it clustered, any HA queues, etc)

And if you could please send over the logs and sasl-logs (stripped of any private data if needs be) that would be very helpful indeed.