<div dir="ltr">Hi All,<div><br></div><div>We are using RabbitMQ 3.1.1 / Erlang R16B on Redhat EL 6.2. We've discovered a scenario that can corrupt the RabbitMQ databases pretty consistently, and are wondering if you might have some suggestions for prevention (or might want to consider a fix if possible).</div>
<div><br></div><div>In short, if you are running two nodes in a cluster, and there are active connections, cutting the power to both nodes in short succession can corrupt both databases. This can be easily reproduced with "reboot -nf" as well.</div>
<div><br></div><div>To reproduce:</div><div><div><ul><li>Make sure both nodes are properly running in the cluster</li><li>Make sure there are active connections to the nodes (doesn't always reproduce otherwise)</li><li>
On Node1, execute: reboot -nf</li><li>Within a few seconds, on Node2, execute: reboot -nf</li><li>When they come back up again, you will not be able to start RabbitMQ</li></ul><div>If you look at the logs you will see the following errors:</div>
</div><div><br></div></div><blockquote style="margin:0 0 0 40px;border:none;padding:0px"><div><div><div><font face="courier new, monospace">=INFO REPORT==== 23-Jul-2013::09:43:55 ===</font></div></div></div><div><div><div>
<font face="courier new, monospace">Starting RabbitMQ 3.1.1 on Erlang R16B</font></div></div></div><div><div><div><font face="courier new, monospace">Copyright (C) 2007-2013 VMware, Inc.</font></div></div></div><div><div>
<div><font face="courier new, monospace">Licensed under the MPL. See <a href="http://www.rabbitmq.com/">http://www.rabbitmq.com/</a></font></div></div></div><div><div><div><font face="courier new, monospace"><br></font></div>
</div></div><div><div><div><font face="courier new, monospace">=INFO REPORT==== 23-Jul-2013::09:43:55 ===</font></div></div></div><div><div><div><font face="courier new, monospace">node : rabbit@node1</font></div>
</div></div><div><div><div><font face="courier new, monospace">home dir : /var/lib/rabbitmq</font></div></div></div><div><div><div><font face="courier new, monospace">config file(s) : (none)</font></div></div></div>
<div><div><div><font face="courier new, monospace">cookie hash : PRImCFlol1hmJpetFO7NUg==</font></div></div></div><div><div><div><font face="courier new, monospace">log : /var/log/rabbitmq/rabbit@node1.log</font></div>
</div></div><div><div><div><font face="courier new, monospace">sasl log : /var/log/rabbitmq/rabbit@node1-sasl.log</font></div></div></div><div><div><div><font face="courier new, monospace">database dir : /var/lib/rabbitmq/mnesia/rabbit@node1</font></div>
</div></div><div><div><div><font face="courier new, monospace"><br></font></div></div></div><div><div><div><font face="courier new, monospace">=INFO REPORT==== 23-Jul-2013::09:43:56 ===</font></div></div></div><div><div><div>
<font face="courier new, monospace">Limiting to approx 3996 file handles (3594 sockets)</font></div></div></div><div><div><div><font face="courier new, monospace"><br></font></div></div></div><div><div><div><font face="courier new, monospace">=INFO REPORT==== 23-Jul-2013::09:44:26 ===</font></div>
</div></div><div><div><div><font face="courier new, monospace">Timeout contacting cluster nodes: ['rabbit@node2'].</font></div></div></div><div><div><div><font face="courier new, monospace"><br></font></div></div>
</div><div><div><div><font face="courier new, monospace">DIAGNOSTICS</font></div></div></div><div><div><div><font face="courier new, monospace">===========</font></div></div></div><div><div><div><font face="courier new, monospace"><br>
</font></div></div></div><div><div><div><font face="courier new, monospace">nodes in question: ['rabbit@node2']</font></div></div></div><div><div><div><font face="courier new, monospace"><br></font></div></div></div>
<div><div><div><font face="courier new, monospace">hosts, their running nodes and ports:</font></div></div></div><div><div><div><font face="courier new, monospace">- node2: []</font></div></div></div><div><div><div><font face="courier new, monospace"><br>
</font></div></div></div><div><div><div><font face="courier new, monospace">current node details:</font></div></div></div><div><div><div><font face="courier new, monospace">- node name: 'rabbit@node1'</font></div>
</div></div><div><div><div><font face="courier new, monospace">- home dir: /var/lib/rabbitmq</font></div></div></div><div><div><div><font face="courier new, monospace">- cookie hash: PRImCFlol1hmJpetFO7NUg==</font></div></div>
</div><div><div><div><font face="courier new, monospace"><br></font></div></div></div><div><div><div><font face="courier new, monospace"><br></font></div></div></div><div><div><div><font face="courier new, monospace"><br>
</font></div></div></div><div><div><div><font face="courier new, monospace">=INFO REPORT==== 23-Jul-2013::09:44:27 ===</font></div></div></div><div><div><div><font face="courier new, monospace">Error description:</font></div>
</div></div><div><div><div><font face="courier new, monospace"> {could_not_start,rabbit,</font></div></div></div><div><div><div><font face="courier new, monospace"> {bad_return,</font></div></div></div><div><div><div>
<font face="courier new, monospace"> {{rabbit,start,[normal,[]]},</font></div></div></div><div><div><div><font face="courier new, monospace"> {'EXIT',</font></div></div></div><div><div><div><font face="courier new, monospace"> {rabbit,failure_during_boot,</font></div>
</div></div><div><div><div><font face="courier new, monospace"> {error,</font></div></div></div><div><div><div><font face="courier new, monospace"> {timeout_waiting_for_tables,</font></div>
</div></div><div><div><div><font face="courier new, monospace"> [rabbit_user,rabbit_user_permission,rabbit_vhost,</font></div></div></div><div><div><div><font face="courier new, monospace"> rabbit_durable_route,rabbit_durable_exchange,</font></div>
</div></div><div><div><div><font face="courier new, monospace"> rabbit_runtime_parameters,</font></div></div></div><div><div><div><font face="courier new, monospace"> rabbit_durable_queue]}}}}}}}</font></div>
</div></div><div><div><div><font face="courier new, monospace"><br></font></div></div></div><div><div><div><font face="courier new, monospace">Log files (may contain more information):</font></div></div></div><div><div><div>
<font face="courier new, monospace"> /var/log/rabbitmq/rabbit@node1.log</font></div></div></div><div><div><div><font face="courier new, monospace"> /var/log/rabbitmq/rabbit@node1-sasl.log</font></div></div></div></blockquote>
<div><div><br></div><div>The only way I've been able to fix this is by deleting the contents of mnesia on both nodes and re-clustering them. Aside from requiring manual intervention, this of course also causes data loss.</div>
<div><br></div><div>I know this is a bit of an edge case, but it has already happened to some of our customers. I guess in the case of a power outage (with no UPS!) it is pretty likely. Does anyone have any thoughts on it? Is it something that can be resolved or is it one of those cases where there's really nothing that can be done?</div>
<div><br></div><div>Thanks!</div><div>Chris</div><br></div></div>