[rabbitmq-discuss] problem with an HA pair of rabbitmq servers

Matthew Sackman matthew at lshift.net
Tue Mar 2 12:19:29 GMT 2010


Hi Allan,

On Sun, Feb 28, 2010 at 06:41:19PM -0800, allan bailey wrote:
> We have a pair of rabbitmq servers.   The 1st server periodically does a lot
> of intense I/O copying data
> out to the 2nd server.   This apparently causes timeouts that then cause a
> partitioning of the cluster.

Very interesting problem - this isn't something we've come across.

I think the fix is to change the net ticktime. The default here is
apparently 60 seconds, and the below diff changes that to 500 seconds.
Obviously with larger values, the cluster will take longer to notice an
error if something bad happens. Please do let us know how you get on.

Obviously, this'll require building from source...

--- a/src/rabbit_node_monitor.erl       Sun Feb 28 10:25:51 2010 +0000
+++ b/src/rabbit_node_monitor.erl       Tue Mar 02 12:15:45 2010 +0000
@@ -48,6 +48,7 @@
 %%--------------------------------------------------------------------
 
 init([]) ->
+    net_kernel:set_net_ticktime(500),
     ok = net_kernel:monitor_nodes(true),
     {ok, no_state}.
 
Matthew




More information about the rabbitmq-discuss mailing list