[rabbitmq-discuss] Active/Active failover and lost messages

Sun Nov 6 22:25:59 GMT 2011

Hello, Matthew.

Thank you for the response and questions. See my comments in line.

Thank you,
Konstantin.

On Nov 4, 8:10 am, Matthew Sackman <matt... at rabbitmq.com> wrote:
> Hi Konstantin,

> On Thu, Nov 03, 2011 at 02:19:51PM -0700, Konstantin Kalin wrote:
> That all sounds fine, but you don't mention whether all 50 publishers
> publish to the same queue and you have 50 consumers consuming from that
> queue, or whether each pair of publisher+consumer have an individual
> queue between them.

Each pair (publisher and consumer) works with a dedicated queue. So I
have 50 queues.

>
> Without understanding your topology better, I'm not quite sure how to
> interpret that.

Topology is simple: publishers and consumers are distributed randomly
between RabbitMQ nodes. And they (publisher and consumers) can
reconnect to another node if there is an exception happens during
publishing/consuming with current connection.

> Ok, so that looks like you're using the Java client? Are you also using
> the QueueingConsumer? It's possible that if not, you've been sent some
> messages but they've being overtaken by the exception in some way. If
> you use the QueueingConsumer, that shouldn't happen. However, that said,
> for other reasons, if this was occurring, I'd expect you to be resent
> such messages when you reconsumed from the queue.
>
> I've recently improved our MulticastMain java example so that it copes
> transparently with the ConsumerCancelledException (though I've actually
> not used it to verify the absence of message loss).
>
> http://hg.rabbitmq.com/rabbitmq-java-client/file/15f36113ffd3/test/sr...
> from line 442 onwards may be of use. Oh yes, I discovered when doing
> that QueueingConsumers are not reusable - you really do have to create a
> new one whenever you resubscribe. That bit me for a while...
>

Yes. You are right. I use Java client and QueueingConsumer.
Also I found a root-cause for my issue. Due to misprint in client
code, the queues were created without "x-ha-policy". It's stupid but
it happened :) So my previous result was gained with non-mirrored
queues. I was really impressed when I found this. The cluster lost not
many messages in such condition.

Once I corrected the mistake the cluster works fine. Consumers don't
lose messages if a cluster node fails (stopped manually, Linux
rebooted and so on). Everything is delivered except one case.
If a node fails under heavy load on the cluster (CPU load is above
90-95% on cluster nodes) a few messages were lost anyway.  A publisher
submitted a message properly but a consumer never received it. I
repeated the test several times and it's reproducible. And now I'm
pretty confident that messages are lost in RabbitMQ (not in my
code :) )
I can spend my time on troubleshooting and better understanding of
what's happening. But I need some guidelines how I can approach with
it.

> Best wishes,
>
> Matthew
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-disc... at lists.rabbitmq.comhttps://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss