[rabbitmq-discuss] Pause minority cluster with publisher confirms losing messages

Wed Jun 4 11:04:28 BST 2014

Hi Michael,

Thanks for your fast reply.

To be honest, I don't mind that when a node goes down in a RabbitMQ cluster
it takes a minute or more to decide that the cluster is broken and what to
do. What I don't fully understand is why the node fallen stops confirming
for a while, then suddenly resumes for some seconds (not always, just some
times) and then stops Rabbit process closing the connection and having
confirmed messages lost.

It's my understanding that the node should do something like, I cannot see
nodes 1 and 2 (connection is broken), I'm by myself here so I cannot
confirm your publishes. Then says I've got to stop, because I'm in
minority. However, the fact that is confirming messages for a small lapse
of time feels like something is not completely working. Also this actually
doesn't always happens, sometimes it does it right, so it's not consistent.

To be honest, i like the trick of having a local RabbitMQ, however for us
it would be simpler just a cluster. Having a local RabbitMQ, maintaing some
federation or shoveling would be a little overkill.

While doing all these tests. Once, when flushing iptables in node3 it has
core dumped some Erlang trace. All times before it simply detects network
and rejoins cluster without issues. is this something i should report? how?

Thanks, cheers
Miguel

2014-06-04 10:17 GMT+02:00 Michael Klishin <mklishin at gopivotal.com>:

>
>
> On 4 June 2014 at 11:58:41, Miguel Araujo Pérez (
> miguel.araujo.perez at gmail.com) wrote:
> > > The issue is that sometimes after a while publisher3 resumes
> > and continues pushing messages and according to the library
> > receiving acks for them, that goes for a period of 6-8 seconds
> > until an exception is raised because connection is closed (node3
> > stops Rabbit). Those "acked messages" aren't however in the
> > queue when I consume it later to see what's inside. However, other
> > times it works as i would expect and doesn't enqueue any other
> > message after iptables takes place.
> >
> > So I thought this could be a library issue, and ported the code
> > to PHP using official php-amqplib and exact same thing happens.
> > My theory is that sometimes node3 after trying to coordinate
> > with other 2 nodes goes into a partition for some seconds, in those
> > seconds it confirms messages and then pause minority cluster
> > policy kicks in and stops Rabbit.
>
> Yes, it takes time for both RabbitMQ and client libraries to detect
> connection failure. This is in part due to how TCP works. You can configure
> the interval of inactivity for RabbitMQ nodes:
>
> https://www.rabbitmq.com/nettick.html
>
> and use a low (say, 1-3 seconds) heartbeat interval for client libraries.
> This should make the exception be thrown much earlier (given that your
> client
> supports it; Pika should) at the cost of having increased network traffic:
>
> http://www.rabbitmq.com/reliability.html
>
> Beyond that, your apps can publish last N messages (excessively) after a
> network
> failure. If your consumers can de-duplicate them (e.g. every message has
> an id you can set),
> that should work well.
>
> If that's not the case, there is a trick that some companies do: they run
> a RabbitMQ
> node local to machine (which at least greatly reduces the probability of
> RabbitMQ becoming
> unreachable), publish with publisher confirms and a low heartbeat interval
> to the local
> node and use Federation [1] or Shovel [2] to connect that node to other
> nodes.
>
> By the way, there are only two official clients: Java and .NET.
>
> 1. http://www.rabbitmq.com/federation.html
> 2. http://www.rabbitmq.com/shovel.html
> --
> MK
>
> Software Engineer, Pivotal/RabbitMQ
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140604/40447598/attachment.html>