[rabbitmq-discuss] Pause minority cluster with publisher confirms losing messages

Michael Klishin mklishin at gopivotal.com
Wed Jun 4 11:22:24 BST 2014


On 4 June 2014 at 14:04:49, Miguel Araujo Pérez (miguel.araujo.perez at gmail.com) wrote:
> > It's my understanding that the node should do something like,  
> I cannot see nodes 1 and 2 (connection is broken), I'm by myself  
> here so I cannot confirm your publishes. Then says I've got to  
> stop, because I'm in minority. However, the fact that is confirming  
> messages for a small lapse of time feels like something is not  
> completely working. Also this actually doesn't always happens,  
> sometimes it does it right, so it's not consistent.

While I'm not very familiar with how the pause process works, there is an inherent race
condition between the decision to pause itself and incoming messages that are confirmed.

Once a node decides to pause, there may be messages "in flight" that were already
read from the socket and parsed, and being delivered to queues. These processes
(in both general and Erlang sense) can run in parallel on machines with over 1 core.

I'm not sure there is a one-size-fits-all solution on the server end. Try publishing
batches of messages and wait for confirms for a batch (and not a single message).
Then you'll have to re-try with batches, too, which means if a part of the earlier
batch was lost due to the race condition explained above, they will be retried.

And batching is a recommended practice with publisher confirms anyway. 
--  
MK  

Software Engineer, Pivotal/RabbitMQ


More information about the rabbitmq-discuss mailing list