[rabbitmq-discuss] RabbitMQ's Mirroring results in weird behavior when slave goes down

Tue Jun 25 11:48:51 BST 2013

Thomas,

On 25 Jun 2013, at 07:28, thomas wrote:

> I have set up 3 cluster nodes namely rabbit at A, rabbit at B, rabbit at C all running
> on erlang 16B and rabbitmq 3.1.1. I set net_ticktime to 2 so as to detect
> node failure faster.
> 

That's a bit excessive I think. Let me quote Erlang's net_kernel man page for a moment:

<quote>
net_ticktime = TickTime
Specifies the net_kernel tick time. TickTime is given in seconds. Once every TickTime/4 second, all connected nodes are ticked (if anything else has been written to a node) and if nothing has been received from another node within the last four (4) tick times that node is considered to be down. This ensures that nodes which are not responding, for reasons such as hardware errors, are considered to be down.

The time T, in which a node that is not responding is detected, is calculated as: MinT < T < MaxT where:

MinT = TickTime - TickTime / 4
MaxT = TickTime + TickTime / 4
TickTime is by default 60 (seconds). Thus, 45 < T < 75 seconds.

Note: All communicating nodes should have the same TickTime value specified.

Note: Normally, a terminating node is detected immediately.

</quote>

So, you're increasing the requirement for nodes to ping one another every 2 / 4 seconds, i.e., every 500 milliseconds. You're also sending 200k messages to a node and expecting HA to distribute those messages across all your nodes, which happens over the same distribution channel as that TickTime message. So I suspect you're not doing yourself any favours here. I'd suggest that *if you must* change net_ticktime (and personally I would leave it alone if I were you) then you should set it to maybe 45 seconds, but not to 2 seconds - that's almost guaranteed to end up in weird behaviour.

> 2)For my 2nd test, it is almost identical to the 1st test except that i am
> using mirroring using the following command:
> 
> rabbitmqctl set_policy HA '^(?!amq\.).*' '{"ha-mode": "all"}'"
> 
> The observation is different from that of the 1st test. Shortly after I cut
> off the network connection from rabbit at B, rabbit at A which is the master
> handling the client's messages comes to a pause for over 15 seconds and the
> pause is consistent for 10 tries. The client gets stuck in basic publish
> when rabbit at A comes to a pause.
> 
> 
> 
> I am quite puzzled about this behavior and would like to find out if that is
> the intended behavior for rabbit's mirroring feature? Does anyone else
> encounter such behavior when using mirroring? 
> 

What version of rabbit are you using? Do you have a cluster auto-recovery (i.e., autoheal) set up, and if so, which mode are you using? Some delay (which blocks publishers temporarily) is possible during failover, but also if you've got autoheal set up, then node can be restarted and waiting (for node restarts) can occur. Remember that a cluster partition is a serious problem, which rabbitmq clusters are /not/ tolerant of. If you're expecting partitions, you should consider using the federation or shovel plugins instead. Automatic cluster partition recovery is there to help, but isn't a panacea.

Cheers,
Tim

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130625/1eb29152/attachment.htm>