[rabbitmq-discuss] Queue data recovery after master failure

Wed Jun 18 16:42:07 BST 2014

On 18/06/14 16:21, Andrei D. wrote:
> Thanks for the quick response Simon.
> I assume there's no easy workaround?

I can't think of an easy one. If I was desperate then I would try the 
following, assuming we start from a completely stopped cluster:

0) Back up Mnesia dirs on all machines, obviously.

1) Start a slave node with RABBITMQ_NODE_ONLY set. Make sure it is set, 
or the slave will start the rabbit app which will clear out the slave's 
persistent storage, and you restore from 0).

2) Run "rabbitmqctl forget_cluster_node --offline <dead-master>"

3) Start the mnesia app on the slave.

4) Update the rabbit_durable_queue records for queues that need 
recovering from this slave, moving the slave pid for the appropriate 
node() from the 'slave_pids' field to the 'pid' field.

5) Start the rabbit app on the slave.

I think that stands a decent chance of working, but obviously the 
usability of such a solution is exceptionally poor. Step 4) in 
particular would require some Erlang programming.

> (such as manually extracting the queue
> data from the slave and copying it to the new master before it rejoins the
> cluster; I'm not familiar with the rabbit queue data storage format so I'm
> not sure if that's feasible - probably not since you haven't mentioned it ;)

I'm not sure how well that would work, you'd have the problem that you 
need not just the individual queue's index files but also the files 
containing that queue's messages from the message store. I can't see 
that being fun to sort out.

> ps: couldn't access the 26191 in bugzilla, I assume it's private to
> contributors?

Yes. But you can look out for it in future release notes, and as a 
branch in hg.

Cheers, Simon

-- 
Simon MacMullen
RabbitMQ, Pivotal