[rabbitmq-discuss] Pause minority cluster with publisher confirms losing messages

Michael Klishin mklishin at gopivotal.com
Wed Jun 4 09:17:38 BST 2014



On 4 June 2014 at 11:58:41, Miguel Araujo Pérez (miguel.araujo.perez at gmail.com) wrote:
> > The issue is that sometimes after a while publisher3 resumes  
> and continues pushing messages and according to the library  
> receiving acks for them, that goes for a period of 6-8 seconds  
> until an exception is raised because connection is closed (node3  
> stops Rabbit). Those "acked messages" aren't however in the  
> queue when I consume it later to see what's inside. However, other  
> times it works as i would expect and doesn't enqueue any other  
> message after iptables takes place.
>  
> So I thought this could be a library issue, and ported the code  
> to PHP using official php-amqplib and exact same thing happens.  
> My theory is that sometimes node3 after trying to coordinate  
> with other 2 nodes goes into a partition for some seconds, in those  
> seconds it confirms messages and then pause minority cluster  
> policy kicks in and stops Rabbit.

Yes, it takes time for both RabbitMQ and client libraries to detect
connection failure. This is in part due to how TCP works. You can configure
the interval of inactivity for RabbitMQ nodes:

https://www.rabbitmq.com/nettick.html

and use a low (say, 1-3 seconds) heartbeat interval for client libraries.
This should make the exception be thrown much earlier (given that your client
supports it; Pika should) at the cost of having increased network traffic:

http://www.rabbitmq.com/reliability.html

Beyond that, your apps can publish last N messages (excessively) after a network
failure. If your consumers can de-duplicate them (e.g. every message has an id you can set),
that should work well.

If that's not the case, there is a trick that some companies do: they run a RabbitMQ
node local to machine (which at least greatly reduces the probability of RabbitMQ becoming
unreachable), publish with publisher confirms and a low heartbeat interval to the local
node and use Federation [1] or Shovel [2] to connect that node to other nodes.

By the way, there are only two official clients: Java and .NET.

1. http://www.rabbitmq.com/federation.html
2. http://www.rabbitmq.com/shovel.html
--  
MK  

Software Engineer, Pivotal/RabbitMQ


More information about the rabbitmq-discuss mailing list