[rabbitmq-discuss] problem with an HA pair of rabbitmq servers

Tue Mar 2 12:19:29 GMT 2010

Hi Allan,

On Sun, Feb 28, 2010 at 06:41:19PM -0800, allan bailey wrote:
> We have a pair of rabbitmq servers.   The 1st server periodically does a lot
> of intense I/O copying data
> out to the 2nd server.   This apparently causes timeouts that then cause a
> partitioning of the cluster.

Very interesting problem - this isn't something we've come across.

I think the fix is to change the net ticktime. The default here is
apparently 60 seconds, and the below diff changes that to 500 seconds.
Obviously with larger values, the cluster will take longer to notice an
error if something bad happens. Please do let us know how you get on.

Obviously, this'll require building from source...

--- a/src/rabbit_node_monitor.erl       Sun Feb 28 10:25:51 2010 +0000
+++ b/src/rabbit_node_monitor.erl       Tue Mar 02 12:15:45 2010 +0000
@@ -48,6 +48,7 @@
 %%--------------------------------------------------------------------
 
 init([]) ->
+    net_kernel:set_net_ticktime(500),
     ok = net_kernel:monitor_nodes(true),
     {ok, no_state}.
 
Matthew