<div dir="ltr">My two nodes cluster in production are breaking with these error messages:<div><br><div><div><font color="#ff6666">=ERROR REPORT==== 23-Dec-2011::04:21:34 ===</font></div><div><font color="#ff6666">** Node rabbit@rabbitmq02 not responding **</font></div>
<div><font color="#ff6666">** Removing (timedout) connection **</font></div><div><font color="#ff6666"><br></font></div><div><font color="#ff6666">=INFO REPORT==== 23-Dec-2011::04:21:35 ===</font></div><div><font color="#ff6666">node rabbit@rabbitmq02 lost 'rabbit'</font></div>
<div><font color="#ff6666"><br></font></div><div><font color="#ff6666">=ERROR REPORT==== 23-Dec-2011::04:21:49 ===</font></div><div><font color="#ff6666">Mnesia(rabbit@rabbitmq01): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit@rabbitmq02}</font></div>
</div></div><div><br></div><div><br></div><div>I tried to simulate the problem by killing the connection between the two nodes using "tcpkill",</div><div>the cluster has disconnected,and surprisingly the two nodes are not trying to reconnect !</div>
<div><br></div><div>When the cluster breaks, haproxy load balancer still marks both nodes as active and send request to both of them,</div><div>although they are not in a cluster.</div><div><br></div><div>My Questions:</div>
<div><br></div><div>1. If the nodes are configured to work as a cluster, when I get a network failure , why aren't they trying to reconnect after ?</div><div><br></div><div>2. How can I identify broken cluster and automatic shutdown one of the nodes ?</div>
<div>(I have consistency problems when working with the two nodes separately)</div><div><br></div><div><br></div><div>Urgent, please help !</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div></div>