[rabbitmq-discuss] RabbitMQ Cluster - Please Help !
matthew at rabbitmq.com
Wed Dec 28 10:56:55 GMT 2011
On Tue, Dec 27, 2011 at 04:20:36PM +0200, ran mizrachi wrote:
> My two nodes cluster in production are breaking with these error messages:
> =ERROR REPORT==== 23-Dec-2011::04:21:34 ===
> ** Node rabbit at rabbitmq02 not responding **
> ** Removing (timedout) connection **
> =INFO REPORT==== 23-Dec-2011::04:21:35 ===
> node rabbit at rabbitmq02 lost 'rabbit'
Ok, you might like to experiment by increasing the net tick time which
is the period after which erlang assumes the cluster has failed.
Something like SERVER_START_ARGS="-kernel net_ticktime 500" in your
rabbitmq-env.conf file might help.
> I tried to simulate the problem by killing the connection between the two
> nodes using "tcpkill",
> the cluster has disconnected,and surprisingly the two nodes are not trying
> to reconnect !
That's not surprising. Rabbit does not try to heal clusters - at least
The fact that the cluster is breaking up suggests something else is
wrong - a networking issue or some such which is causing packet loss.
> 1. If the nodes are configured to work as a cluster, when I get a network
> failure , why aren't they trying to reconnect after ?
Rabbit does C and A. Not P. Thus a failure of a node is permanent,
without manual intervention.
> 2. How can I identify broken cluster and automatic shutdown one of the
> nodes ?
Ahh well, assume you have a cluster of N nodes, and a network partion
splits them all apart, and now you have N clusters, each of 1 node.
Which cluster is the "right" cluster? In general, this is a very hard
problem - it's not always right to just take the biggest remaining
cluster, even if there is one. It's often very application specific too.
You'd probably be best off looking at things like Pacemaker for this
stuff seeing as it's really built to help solve these sorts of problems
and supports STONITH and other such devices.
More information about the rabbitmq-discuss