[rabbitmq-discuss] RabbitMQ's Mirroring results in weird behavior when slave goes down

Wed Jul 3 10:46:34 BST 2013

Hi Thomas,

On 27 Jun 2013, at 02:18, thomas wrote:

> I have tried using ignore instead of autoheal for cluster_partition_handling
> but the pause still persist, but the pause begins approximately 20 seconds
> after bringing down the mirror node. 
> 

That won't help you I'm afraid.

> Please kindly let me know if you experience the same problem as I can't be
> sure if there could be any problem on my side which I doubt so. Thanks.
> 

There's nothing wrong with your sending code. The issue appears to be that the channel is expecting to receive credit (during normal flow control) from not just the master queue process, but from all the slaves too. This can cause a delay of up to (and even exceeding) net_ticktime, basically until the node notices that its peer has gone away. The reason why introducing a delay to the publishing rate stops the pause, is that when publishing slows down, we don't end up hitting flow control and the logic that's waiting for slaves therefore doesn't kick in.

Lowering the net_ticktime can alleviate this to some extent, but will not solve the problem consistently as there's always going to be a possibility of delay (depending on the current load, network characteristics, etc). You should also bear in mind that lowering the net_ticktime too much can be quite dangerous if the network is already under considerable load, since ticks might get lost behind other traffic, leading to false (unwanted) net-splits occurring.

I have filed a bug to try and improve this behaviour (in the face of slave deaths) and will try to get the fix into the next patch release. Thanks for reporting this behaviour, and for sticking with the conversation while we worked through the possible causes.

Cheers,
Tim