[rabbitmq-discuss] empty rabbit_serial file causes rabbitmq cluster to hang

Denisenko, Mikhail (NIH/NLM/NCBI) [C] mikhail.denisenko at nih.gov
Wed May 29 15:43:06 BST 2013


Thanks for the feature request. In our case this file become corrupted after restart of rabbitmq. Our restart procedure is to try graceful restart first and after some time try kill -9 if it didn't exit. I suspect that in this case it didn't die within timeout and was killed with -9 and maybe it didn't flush buffer for this file.
________________________________________
From: Matthias Radestock [matthias at rabbitmq.com]
Sent: Wednesday, May 29, 2013 3:49 AM
To: Discussions about RabbitMQ
Cc: Denisenko, Mikhail (NIH/NLM/NCBI) [C]
Subject: Re: [rabbitmq-discuss] empty rabbit_serial file causes rabbitmq cluster to hang

On 28/05/13 23:42, Denisenko, Mikhail (NIH/NLM/NCBI) [C] wrote:
> We had a problem today that was caused by wiped rabbit_serial file on
> one of the nodes. Our rabbitmq cluster consisting of two nodes become
> inaccessible.
>
> There is a ticket: https://github.com/rabbitmq/rabbitmq-server/issues/17
>
> I think that rabbitmq should be able to recover from this situation by
> maybe removing invalid rabbit_serial file, what do you think?

As Emile mentioned in the ticket, rabbit cannot recover from arbitrary
filesystem corruption.

The specific issue of rabbit_serial being empty actually has come up
once before - see
http://rabbitmq.1065348.n5.nabble.com/Cannot-start-guid-generator-rabbitmq-td4415.html.
So I have filed a feature request to investigate whether we should
handle this particular case of corruption more gracefully.

I recommend you look into what may have caused the corruption. While
there is an easy workaround in this particular instance, the same isn't
true for other files, and corruptions there could easily lead to data
loss, possibly undetected.


Regards,

Matthias.


More information about the rabbitmq-discuss mailing list