[rabbitmq-discuss] crash in a two node RabbitMQ cluster

Mon Dec 17 10:59:02 GMT 2012

On 15/12/2012 8:22PM, Aravindh S wrote:
> Hi

Hi.

> we are running RabbitMQ v 2.8.4 in a two node cluster configuration.
>
> we had an unplanned power outage and both the servers went down. when we
> tried to restart the rabbitmq servers, only rabbit2 node starts up and
> the node rabbit1 crashes on start.
> we are running several mirrored queues between these nodes.one such
> queue "Aiken" contained more than 65K messages before the outage.Now
> rabbit1 wont start and rabbit2 starts fine but shows that there are only
> 109 old messages in the "Aiken" Queue.We are afraid if we have lost the
> messages from the rabbit1 crash.

At the risk of asking something obvious: were all the messages published 
to "Aiken" published with delivery_mode=2 (persistent)? And 
non-persistent messages will be removed from the queue after restart.

> Rabbit1 node crashes on startup on both conditions where rabbit2 was
> down and also when rabbit2 was up.
>
> we could see the following message in the startup log,
>
> BOOT FAILED
> ===========
>
> Error description:
>
>   {badmatch,{error,{"/var/lib/rabbitmq/mnesia/rabbit at rabbit1/queues/1NGZF3JZJR0SU2C0VE2S25JRP/clean.dot",
>                       eacces}}}

"eacces" is the key here - for some reason the server is not being 
permitted to read the file by the operating system. Assuming you have 
installed via debs / RPMs, all files under /var/lib/rabbitmq/mnesia 
should be owned by the "rabbitmq" user - are they?

> logs are available here:

Looking at the logs it looks like you had several attempts to start 
rabbit1 before that error message showed up, but they were stymied by a 
bug in the management plugin startup code that had been fixed since 2.8.4...

> Can anyone help me with ideas to recover rabbit1 ??
> Is there a way to tweak the startup of Rabbit1 so that it would start as
> an independent node ?

...however, even if you start rabbit1 as part of the cluster it will 
start its mirrored queues from scratch (see 
http://www.rabbitmq.com/ha.html#unsynchronised-slaves).

It's not easy to start such a node independently in 2.x I'm afraid (this 
was improved in 3.0). I wrote some rather ad-hoc instructions here: 
http://rabbitmq.1065348.n5.nabble.com/Repairing-a-a-crashed-cluster-td22466.html

But I'm afraid that if the messages were originally published in 
non-persistent mode you won't get them back - they would never even have 
made it to disc.

Cheers, Simon