<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Thomas,<div><br><div><div>On 25 Jun 2013, at 07:28, thomas wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div>I have set up 3 cluster nodes namely rabbit@A, rabbit@B, rabbit@C all running<br>on erlang 16B and rabbitmq 3.1.1. I set net_ticktime to 2 so as to detect<br>node failure faster.<br><br></div></blockquote><div><br></div><div>That's a bit excessive I think. Let me quote Erlang's net_kernel man page for a moment:</div><div><br></div><div><quote></div><div><dt><strong><span class="code">net_ticktime = TickTime</span></strong></dt>
<dd>
<a name="net_ticktime"></a><p>Specifies the <span class="code">net_kernel</span> tick time. <span class="code">TickTime</span>
is given in seconds. Once every <span class="code">TickTime/4</span> second, all
connected nodes are ticked (if anything else has been written
to a node) and if nothing has been received from another node
within the last four (4) tick times that node is considered
to be down. This ensures that nodes which are not responding,
for reasons such as hardware errors, are considered to be
down.</p><p>The time <span class="code">T</span>, in which a node that is not responding is
detected, is calculated as: <span class="code">MinT < T < MaxT</span> where:</p>
<div class="example"><pre>MinT = TickTime - TickTime / 4
MaxT = TickTime + TickTime / 4</pre></div><p><span class="code">TickTime</span> is by default 60 (seconds). Thus,
<span class="code">45 < T < 75</span> seconds.</p><p><strong>Note:</strong> All communicating nodes should have the same
<span class="code">TickTime</span> value specified.</p><p><strong>Note:</strong> Normally, a terminating node is detected
immediately.</p></dd></div></quote></div><div><br></div><div>So, you're increasing the requirement for nodes to ping one another every 2 / 4 seconds, i.e., every 500 milliseconds. You're also sending 200k messages to a node and expecting HA to distribute those messages across all your nodes, which happens over the same distribution channel as that TickTime message. So I suspect you're not doing yourself any favours here. I'd suggest that *if you must* change net_ticktime (and personally I would leave it alone if I were you) then you should set it to maybe 45 seconds, but not to 2 seconds - that's almost guaranteed to end up in weird behaviour.</div><div><br><blockquote type="cite"><div>2)For my 2nd test, it is almost identical to the 1st test except that i am<br>using mirroring using the following command:<br><br>rabbitmqctl set_policy HA '^(?!amq\.).*' '{"ha-mode": "all"}'"<br><br>The observation is different from that of the 1st test. Shortly after I cut<br>off the network connection from rabbit@B, rabbit@A which is the master<br>handling the client's messages comes to a pause for over 15 seconds and the<br>pause is consistent for 10 tries. The client gets stuck in basic publish<br>when rabbit@A comes to a pause.<br><br></div></blockquote><blockquote type="cite"><div><br><br>I am quite puzzled about this behavior and would like to find out if that is<br>the intended behavior for rabbit's mirroring feature? Does anyone else<br>encounter such behavior when using mirroring? <br><br></div></blockquote><div><br></div><div>What version of rabbit are you using? Do you have a cluster auto-recovery (i.e., autoheal) set up, and if so, which mode are you using? Some delay (which blocks publishers temporarily) is possible during failover, but also if you've got autoheal set up, then node can be restarted and waiting (for node restarts) can occur. Remember that a cluster partition is a serious problem, which rabbitmq clusters are /not/ tolerant of. If you're expecting partitions, you should consider using the federation or shovel plugins instead. Automatic cluster partition recovery is there to help, but isn't a panacea.</div><div><br></div><div>Cheers,</div><div>Tim</div><div><br></div></div></div></body></html>