[rabbitmq-discuss] Lost messages in cluster

Wed Jan 18 18:53:34 GMT 2012

Hi, st0rm...

I'm a bit confused by parts of your post...  Could we walk through
your configuration a bit so that I might get a handle on what you're
doing and what you're having trouble with?

> I have a cluster with 4 disk nodes. Topology is similar to
> many-to-many:

When you say the "topology" in this context, what do you mean exactly?

> The only exchange distributed over all nodes.

Exchanges in a clustered RabbitMQ environment are recorded in the
Mnesia database that's distributed amongst the nodes that participate
in the cluster, so in some sense any exchange in a cluster is always
"distributed over all nodes."  Note that the exchange is basically
static data in Mnesia; all of the active work done in routing messages
through exchanges actually happens on a channel process associated
with a connection, as there's no exchange process per se.

> Two
> queues(one of them is durable) declared per node and bound to above
> mentioned exchange. So each node is able to publish messages and they
> will be delivered to other nodes of cluster.

Are you using the new HA/mirrored-queues feature, and have declared
replication of queues across cluster nodes explicitly? Or are you
using "regular" clustering of the older type? If the latter then the
queue processes and queue contents will 'really' live only on a single
node in the cluster, although of course they can be published by, or
delivered to clients who happen to be connected to any node in the
cluster.

> Recently I've discovered some problems when I have network failures.
> If node lose network connection (or network device is down) for
> approx. 1 minute ( or net_ticktime period of time), this node doesn't
> receive message that were published to exchange while network was
> down.

By "node" do you mean one of your producers or consumers, or one of
the Rabbit cluster members? If you mean a Rabbit cluster member then
you can't publish to, or consume from, a queue as long as the node on
which it lives is down. If you use the new active-active,
mirrored-queues HA system then you can specify that a queue be
replicated across multiple nodes in a cluster, and the loss of the
master queue replica leads to one of the replicated slaves taking
over.  There's also an older active-passive HA system where you use
something like Pacemaker to switch over from a failed cluster node to
a hot standby, that was sharing storage with its backup brethren.

> I've read that "RabbitMQ clustering does not tolerate network partions
> well".

When people say that, they usually mean that the Mnesia database that
holds the metadata defining queues, bindings, exchanges and the like
doesn't handle partitions well, as a Rabbit cluster relies on a
consistent view of this information to do its work sensibly.  Hence,
we don't try to Rabbit cluster over wide areas and high latency or low
reliability links, because those don't play well with the assumptions
that Mnesia's design makes.

> I suppose that federation plugin can help me, but I can't
> understand what network topology have to be built to preserve current
> functionality - what upstreams should exist and etc.

Again I'm a bit confused when you say "network topology."  One often
uses the Federation or Shovel plugins to bridge between brokers (or
clusters of brokers) over a potentially high-latency, and low
reliability WAN connection, typically between geographically disparate
locations.  Is this the scenario you have, i.e. brokers or clusters
thereof in geographically separate datacenters?  If so, the first
task is to identify which parts of your message traffic you need
moved from one site to the other and think through the setup of your
federations or shovels from that starting point...

> Could somebody help me to avoid losing messages?

Answering this fully depends on what potential causes of message loss
you want to immunize yourself against.  If you're worried about the
failure of a cluster node on which a queue resides rendering that
queue unavailable, your options are either:

  - the new active/active mirrored queues HA

  - the old active/passive system with shared storage and something
    like Pacemaker handling the failover

If you need messages to be moved over potentially high latency or
flaky WAN links, then you want to consider Shovel or Federation for
bridging the wild network waters between the islands on which your
clusters live.

Both are documented on the RabbitMQ website, and the latter is
discussed in the Manning book "RabbitMQ in Action" (currently
available as a preview eBook, final print version due out later this
Spring).

Does this help at all?

Best regards,
Jerry