[rabbitmq-discuss] Recipe for corrupting mnesia in a cluster

Tue Jul 23 15:39:16 BST 2013

Hi All,

We are using RabbitMQ 3.1.1 / Erlang R16B on Redhat EL 6.2.  We've
discovered a scenario that can corrupt the RabbitMQ databases pretty
consistently, and are wondering if you might have some suggestions for
prevention (or might want to consider a fix if possible).

In short, if you are running two nodes in a cluster, and there are active
connections, cutting the power to both nodes in short succession can
corrupt both databases.  This can be easily reproduced with "reboot -nf" as
well.

To reproduce:

   - Make sure both nodes are properly running in the cluster
   - Make sure there are active connections to the nodes (doesn't always
   reproduce otherwise)
   - On Node1, execute: reboot -nf
   - Within a few seconds, on Node2, execute: reboot -nf
   - When they come back up again, you will not be able to start RabbitMQ

If you look at the logs you will see the following errors:

=INFO REPORT==== 23-Jul-2013::09:43:55 ===
Starting RabbitMQ 3.1.1 on Erlang R16B
Copyright (C) 2007-2013 VMware, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/

=INFO REPORT==== 23-Jul-2013::09:43:55 ===
node           : rabbit at node1
home dir       : /var/lib/rabbitmq
config file(s) : (none)
cookie hash    : PRImCFlol1hmJpetFO7NUg==
log            : /var/log/rabbitmq/rabbit at node1.log
sasl log       : /var/log/rabbitmq/rabbit at node1-sasl.log
database dir   : /var/lib/rabbitmq/mnesia/rabbit at node1

=INFO REPORT==== 23-Jul-2013::09:43:56 ===
Limiting to approx 3996 file handles (3594 sockets)

=INFO REPORT==== 23-Jul-2013::09:44:26 ===
Timeout contacting cluster nodes: ['rabbit at node2'].

DIAGNOSTICS
===========

nodes in question: ['rabbit at node2']

hosts, their running nodes and ports:
- node2: []

current node details:
- node name: 'rabbit at node1'
- home dir: /var/lib/rabbitmq
- cookie hash: PRImCFlol1hmJpetFO7NUg==

=INFO REPORT==== 23-Jul-2013::09:44:27 ===
Error description:
   {could_not_start,rabbit,
       {bad_return,
           {{rabbit,start,[normal,[]]},
            {'EXIT',
                {rabbit,failure_during_boot,
                    {error,
                        {timeout_waiting_for_tables,

[rabbit_user,rabbit_user_permission,rabbit_vhost,
                             rabbit_durable_route,rabbit_durable_exchange,
                             rabbit_runtime_parameters,
                             rabbit_durable_queue]}}}}}}}

Log files (may contain more information):
   /var/log/rabbitmq/rabbit at node1.log
   /var/log/rabbitmq/rabbit at node1-sasl.log

The only way I've been able to fix this is by deleting the contents of
mnesia on both nodes and re-clustering them.  Aside from requiring manual
intervention, this of course also causes data loss.

I know this is a bit of an edge case, but it has already happened to some
of our customers.  I guess in the case of a power outage (with no UPS!) it
is pretty likely.  Does anyone have any thoughts on it?  Is it something
that can be resolved or is it one of those cases where there's really
nothing that can be done?

Thanks!
Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130723/5583cb97/attachment.htm>