[rabbitmq-discuss] Active/Active failover and lost messages

Matthew Sackman matthew at rabbitmq.com
Mon Nov 7 09:52:14 GMT 2011


Hi Konstantin,

On Sun, Nov 06, 2011 at 02:25:59PM -0800, Konstantin Kalin wrote:
> Also I found a root-cause for my issue. Due to misprint in client
> code, the queues were created without "x-ha-policy". It's stupid but
> it happened :) So my previous result was gained with non-mirrored
> queues. I was really impressed when I found this. The cluster lost not
> many messages in such condition.

Ah ha! Glad you found it.

> Once I corrected the mistake the cluster works fine. Consumers don't
> lose messages if a cluster node fails (stopped manually, Linux
> rebooted and so on). Everything is delivered except one case.
> If a node fails under heavy load on the cluster (CPU load is above
> 90-95% on cluster nodes) a few messages were lost anyway.  A publisher
> submitted a message properly but a consumer never received it. I
> repeated the test several times and it's reproducible. And now I'm
> pretty confident that messages are lost in RabbitMQ (not in my
> code :) )

Interesting. I would guess that that's the case when the publisher is
connected to a highly loaded node, and the publisher's node is taken
down. Could you check if that's the case?

Are you using publisher confirms? If so, it really shouldn't be the case
that the confirm is received by the client and the message being publish
is lost. That should never happen. If you're not using publisher
confirms then yes, it is possible that the publisher thinks it's sent a
message but due to failures, that message never manages to make it
through to any of the members of the mirrored queue. Publisher confirms
are only issued back to the client when the message has made it to all
the queues it's been routed to, so you really shouldn't get confirms for
messages that get lost in this way.

If a message has not been confirmed back to the publisher, and the
publisher gets disconnected due to node failure, the publisher *must*
republish any messages it has not received confirms for. Yes, this does
introduce more possibility of duplicates at the consumers, but you have
to do dedupification anyway for other reasons.

Best wishes,

Matthew


More information about the rabbitmq-discuss mailing list