[rabbitmq-discuss] Pause minority cluster with publisher confirms losing messages

Miguel Araujo Pérez miguel.araujo.perez at gmail.com
Thu Jun 5 11:38:49 BST 2014


Hi,

Thanks, is there a URL I can access to follow bug status?

> Once a node decides to pause, there may be messages "in flight" that were
already
> read from the socket and parsed, and being delivered to queues. These
processes
> (in both general and Erlang sense) can run in parallel on machines with
over 1 core.

My understanding is that for confirming a message, a node in the cluster
must see the other nodes and get confirmation from them. If that is the
case It makes sense it doesn't confirm messages when iptables rules are
applied and that is what happens after some seconds, when it resumes and
starts confirming messages that are then lost. Not sure I follow how
multiple cores make things harder here, I'm probably not seeing some
concurrent issue here.

I'm attaching Erlang log from node3 here. if you look at it you will see
how first thing it detects is node rabbitmq-2 and rabbitmq-1 are not
responding. Then it promotes mirrored queues from slave to master. Cluster
minority status detected comes last thing.

I'm not an expert in RabbitMQ internals, I've been reading the code parts
that control this flow and it feels like confirms could be paused until
being sure things are ok. I mean if node3 knows it's connected to 2 nodes
(node2 and node1), then sees both nodes down, looks like something is going
wrong.

The part that most strikes me is that it takes 1 minute and 3 seconds to
detect minority since we know both nodes are down?

I will send another email with the Erlang trace.

Thanks, cheers
Miguel


2014-06-04 12:57 GMT+02:00 Michael Klishin <mklishin at gopivotal.com>:

> On 4 June 2014 at 14:55:57, Miguel Araujo Pérez (
> miguel.araujo.perez at gmail.com) wrote:
> > > While doing all these tests. Once, when flushing iptables in
> > node3 it has core dumped some Erlang trace. All times before it
> > simply detects network and rejoins cluster without issues.
> > is this something i should report? how?
>
> Miguel,
>
> We have filed a bug for the general issue here. Feel free to post the trace
> you see to the list (unless you think it contains sensitive information,
> which
> is probably doesn't).
> --
> MK
>
> Software Engineer, Pivotal/RabbitMQ
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140605/358c8082/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: node3.log
Type: application/octet-stream
Size: 2782 bytes
Desc: not available
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140605/358c8082/attachment.obj>


More information about the rabbitmq-discuss mailing list