[rabbitmq-discuss] Looking for some clarification on mirrored queue implementation

Tue Feb 7 17:41:14 GMT 2012

Hi Matt,

On Tue, Feb 07, 2012 at 05:18:54PM +0000, Emile Joubert wrote:
> Neither. RabbitMQ uses a separate Erlang process per channel and this
> channel process is responsible for sending publish messages to each
> slave as well as the master.

Indeed. It's actually no different from a message being published to an
exchange which then routes the message to several different queues - the
channel process on the broker is responsible for finding out which
queues the message is destined for and forwarding the message to all
those queues. In the case of a mirrored queue, the expansion step to go
from "queue name" to "queue process ID" returns several process IDs.

> The slaves and master make use of a
> separate fault-tolerant framework (Guaranteed Multicast) to communicate.
> The master uses GM to communicate all messages (including publish
> messages) to slaves. Slaves therefore receive publish messages from the
> channel as well as GM, and all other messages from GM only.

...and the purpose of this is as follows.

Because you can have multiple channels publishing messages to the same
mirrored queue at the same time, there is the possibility that different
members of the mirrored queue see the publishes in different orders when
they receive the publishes directly from the channel processes. This
will not do - the messages *must* be in the same order in all members of
the mirrored queue. This is why publishes *also* go via GM - the master
pushes each publish onto GM and the slaves received that and use it to
derive the correct order.

But, if you *only* had publishes being sent to the master and then the
master pushes them via GM to all the slaves, then, in the event of the
death of the master, there's a window of time before any of the slaves
notice the death of the master during which there could be in-flight
publishes going from the channels to the old master which will be lost -
the master is dead so will not be able to process those publishes and
push them onto GM.

So as a result, publishes go via both routes - directly to all the
members of the mirrored queue to ensure that no publishes ever get lost,
and secondly via GM, pushed by the master, so that the slaves can
actually enqueue the messages in the right order.

> The sources contain some further detail which may be of interest:
> http://hg.rabbitmq.com/rabbitmq-server/file/default/src/rabbit_mirror_queue_coordinator.erl
> http://hg.rabbitmq.com/rabbitmq-server/file/default/src/gm.erl

Indeed. In general, it's much more complex than you can imagine. For
example, a publishing client that's using publisher confirms could be
publishing to a mirrored queue. Each confirm will only be issued once
all members of the mirrored queue have received *and* correctly
enqueued the message. And even if the master and any-but-not-all slaves
fail, the code will ensure that not only are no messages lost, but all
the confirms will still be correctly issued, assuming both the node to
which the client is connected survives and at least one member of the
mirrored queue survives.

Most if not all of the complexity arises from a) we do almost everything
regarding publishing asynchronously to take advantage of parallelism and
ensure performance, but this means some nodes could fall a long way
behind others; and b) failures and births can occur at any time and we
try pretty hard to cope transparently with almost everything. And some
of it we even get right...

Matthew