[rabbitmq-discuss] Recipe for corrupting mnesia in a cluster

Wed Jul 24 17:40:04 BST 2013

Ah-ha!  You are right!  Whenever I did my testing on this, I would start
one node and wait for the status to come back "OK" or "FAILED" before
starting the other.  Now if I start both at the same time, it works
splendidly!  Thank you for that.

I have a couple of followup questions, if you don't mind:

   - Is it possible to configure RabbitMQ to wait longer than 30 seconds
   before timing out?  I looked in the docs and couldn't find anything that
   seemed to address this.

   - If for some reason one of the nodes cannot be brought back online,
   would we then need to "forget" it on the other node (as described below)?
      - export RABBITMQ_NODE_ONLY=true
      - rabbitmq-server &
      - rabbitmqctl forget_cluster_node --offline rabbit at node1

Thanks again for the reply!  I feel a lot better about things now. ;-)

-Chris

On Wed, Jul 24, 2013 at 10:51 AM, Matthias Radestock
<matthias at rabbitmq.com>wrote:

> Chris,
>
>
> On 23/07/13 15:39, Chris wrote:
>
>> We are using RabbitMQ 3.1.1 / Erlang R16B on Redhat EL 6.2.  We've
>> discovered a scenario that can corrupt the RabbitMQ databases pretty
>> consistently, and are wondering if you might have some suggestions for
>> prevention (or might want to consider a fix if possible).
>>
>> In short, if you are running two nodes in a cluster, and there are
>> active connections, cutting the power to both nodes in short succession
>> can corrupt both databases.
>> [...]
>>
>>     =INFO REPORT==== 23-Jul-2013::09:44:26 ===
>>     Timeout contacting cluster nodes: ['rabbit at node2'].
>>
>
> The issue here is that the 2nd node did not come back up within 30s of the
> first. If it had everything would have been fine.
>
> No db corruption has occurred. This is simply a case of both nodes
> thinking they weren't the last to shut down and waiting for the other to
> come up.
>
>
>  The only way I've been able to fix this is by deleting the contents of
>> mnesia on both nodes and re-clustering them.
>>
>
> Starting rabbit on both nodes inside 30 seconds should resolve the problem.
>
> Regards,
>
> Matthias.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130724/98ce1b23/attachment.htm>