[rabbitmq-discuss] crash in a two node RabbitMQ cluster
Simon MacMullen
simon at rabbitmq.com
Mon Dec 17 10:59:02 GMT 2012
On 15/12/2012 8:22PM, Aravindh S wrote:
> Hi
Hi.
> we are running RabbitMQ v 2.8.4 in a two node cluster configuration.
>
> we had an unplanned power outage and both the servers went down. when we
> tried to restart the rabbitmq servers, only rabbit2 node starts up and
> the node rabbit1 crashes on start.
> we are running several mirrored queues between these nodes.one such
> queue "Aiken" contained more than 65K messages before the outage.Now
> rabbit1 wont start and rabbit2 starts fine but shows that there are only
> 109 old messages in the "Aiken" Queue.We are afraid if we have lost the
> messages from the rabbit1 crash.
At the risk of asking something obvious: were all the messages published
to "Aiken" published with delivery_mode=2 (persistent)? And
non-persistent messages will be removed from the queue after restart.
> Rabbit1 node crashes on startup on both conditions where rabbit2 was
> down and also when rabbit2 was up.
>
> we could see the following message in the startup log,
>
> BOOT FAILED
> ===========
>
> Error description:
>
> {badmatch,{error,{"/var/lib/rabbitmq/mnesia/rabbit at rabbit1/queues/1NGZF3JZJR0SU2C0VE2S25JRP/clean.dot",
> eacces}}}
"eacces" is the key here - for some reason the server is not being
permitted to read the file by the operating system. Assuming you have
installed via debs / RPMs, all files under /var/lib/rabbitmq/mnesia
should be owned by the "rabbitmq" user - are they?
> logs are available here:
Looking at the logs it looks like you had several attempts to start
rabbit1 before that error message showed up, but they were stymied by a
bug in the management plugin startup code that had been fixed since 2.8.4...
> Can anyone help me with ideas to recover rabbit1 ??
> Is there a way to tweak the startup of Rabbit1 so that it would start as
> an independent node ?
...however, even if you start rabbit1 as part of the cluster it will
start its mirrored queues from scratch (see
http://www.rabbitmq.com/ha.html#unsynchronised-slaves).
It's not easy to start such a node independently in 2.x I'm afraid (this
was improved in 3.0). I wrote some rather ad-hoc instructions here:
http://rabbitmq.1065348.n5.nabble.com/Repairing-a-a-crashed-cluster-td22466.html
But I'm afraid that if the messages were originally published in
non-persistent mode you won't get them back - they would never even have
made it to disc.
Cheers, Simon
More information about the rabbitmq-discuss
mailing list