%entities; ]>
If your RabbitMQ broker consists of a single node, then a failure of that node will cause downtime, temporary unavailability of service, and potentially loss of messages (especially non-persistent messages held by non-durable queues). You could publish all messages persistent, to durable queues, but even then, due to buffering there is an amount of time between the message being sent and the message being written to disk and fsync'd. Using publisher confirms is one means to ensure the client understands which messages have been written to disk, but even so, you may not wish to suffer the downtime and inconvenience of the unavailability of service caused by a node failure, or the performance degradation of having to write every message to disk.
You could use a cluster of RabbitMQ nodes to construct your RabbitMQ broker. This will be resilient to the loss of individual nodes in terms of the overall availability of service, but some important caveats apply: whilst exchanges and bindings survive the loss of individual nodes, queues and their messages do not. This is because a queue and its contents reside on exactly one node, thus the loss of a node will render its queues unavailable.
You could use an active/passive pair of nodes such that should one node fail, the passive node will be able to come up and take over from the failed node. This can even be combined with clustering. Whilst this approach ensures that failures are quickly detected and recovered from, there can be reasons why the passive node can take a long time to start up, or potentially even fail to start. This can cause at best, temporary unavailability of queues which were located on the failed node.
To solve these various problems, we have developed active/active high availability for queues. This works by allowing queues to be mirrored on other nodes within a RabbitMQ cluster. The result is that should one node of a cluster fail, the queue can automatically switch to one of the mirrors and continue to operate, with no unavailability of service. This solution still requires a RabbitMQ cluster, which means that it will not cope seamlessly with network partitions within the cluster and, for that reason, is not recommended for use across a WAN (though of course, clients can still connect from as near and as far as needed).
A mirrored queue will behave the same as a non-mirrored queue, with the following exceptions:
In normal operation, for each mirrored-queue, there is one master and several slaves, each on a different node. The slaves apply the operations that occur to the master in exactly the same order as the master and thus maintain the same state. All actions other than publishes go only to the master, and the master then broadcasts the effect of the actions to the slaves. Thus clients consuming from a mirrored queue are in fact consuming from the master.
Should a slave fail, there is little to be done other than some bookkeeping: the master remains the master and no client need take any action or be informed of the failure.
If the master fails, then one of the slaves must be promoted. At this point, the following happens:
As the chosen slave becomes the master, no messages that are published to the mirrored-queue during this time will be lost: messages published to a mirrored-queue are always published directly to the master and all slaves. Thus should the master fail, the messages continue to be sent to the slaves and will be added to the queue once the promotion of a slave to the master completes.
Similarly, messages published by clients using publisher confirms will still be confirmed correctly even if the master (or any slaves) fail between the message being published and the message being able to be confirmed to the publisher. Thus from the point of view of the publisher, publishing to a mirrored-queue is no different from publishing to any other sort of queue. It is only consumers that need to be aware of the possibility of needing to re-consume from a mirrored-queue upon receipt of a Consumer Cancellation Notification.
If you are consuming from a mirrored-queue with noAck=true (i.e. the client is not sending message acknowledgements) then messages can be lost. This is no different from the norm of course: the broker considers a message acknowledged as soon as it has been sent to a noAck=true consumer, and should the client disconnect abruptly, the message may never be received. In the case of a mirrored-queue, should the master die, messages that are in-flight on their way to noAck=true consumers may never be received by those clients, and will not be requeued by the new master. Because of the possibility the the consuming client is connected to a node that survives, the Consumer Cancellation Notification is useful in identifying when such events may have occurred. Of course, in practise, if you care about not losing messages then you are advised to consume with noAck=false.
A node may join a cluster at any time. Depending on the configuration of a queue, when a node joins a cluster, queues may add a slave on the new node. At this point, the new slave will be empty: it will not contain any existing contents of the queue, and currently, there is no synchronisation protocol. Such a slave will receive new messages published to the queue, and thus over time will accurately represent the tail of the mirrored-queue. As messages are drained from the mirrored-queue, the size of the head of the queue for which the new slave is missing messages, will shrink until eventually the slave's contents precisely match the master's contents. At this point, the slave can be considered fully synchronised, but it is important to note that this has occured because of actions of clients in terms of draining the pre-existing head of the queue.
Thus a newly added slave provides no additional form of redundancy or availability of the queue's contents until the contents of the queue that existed before the slave was added have been removed. As a result of this, it is preferable to bring up all nodes on which slaves will exist prior to creating mirrored queues, or even better to ensure that your use of messaging generally results in very short or empty queues that rapidly drain.
If you stop a RabbitMQ node which contains the master of a mirrored-queue, some slave on some other node will be promoted to the master (assuming there is one). If you continue to stop nodes then you will reach a point where a mirrored-queue has no more slaves: it exists only on one node, which is now its master. If the mirrored-queue was declared durable then, if its last remaining node is shutdown, durable messages in the queue will survive the restart of that node. In general, as you restart other nodes, if they were previously part of a mirrored-queue then they will rejoin the mirrored queue.
However, there is currently no way for a slave to know whether or not its queue contents have diverged from the master to which it is rejoining (this could happen during a network partition, for example). As such, when a slave rejoins a mirrored-queue, it throws away any durable local contents it already has and starts empty. It's behaviour is at this point the same as if it were a new node joining the cluster.
A mirrored-queue must be created as a mirrored-queue. It is not possible to convert a non-mirrored-queue to a mirrored-queue at some later point. It is perfectly acceptable to create a mirrored-queue with no slaves initially, though be aware of the behaviour of adding nodes to a cluster.
To create a mirrored-queue, you provide an
x-ha-policy
entry in the argument table
presented to queue.declare
. The value of this
entry is a long string
which gives the name of
the policy you wish to use for this queue. There are
currently two policies available:
all
: This policy ensures that a mirror of
the queue will exist on every node in the cluster. When a
new node joins the cluster, a mirror of the queue will be
created on the new node. The queue's initial master will be
on the node to which the client issuing the queue declare is
connected.nodes
: This policy allows you to specify
the exact nodes on which you wish mirrors to exist. To do
this, you provide an additional entry in the arguments
table, with a key of x-ha-policy-params
, and a
value which is an array of long strings
, each
of which is a node name. The queue's initial master
will be the first node in the list (and thus this node
must exist and be reachable, even if it is not the
node to which the client is connected), and slaves are
created on the remaining nodes in the list. With this
policy, you may declare a mirrored-queue with mirrors on a
subset of the nodes in a cluster. You may also declare a
mirrored-queue with slaves on nodes which are not currently
members of the cluster: when such nodes join the cluster, a
slave will automatically be created of the queue on that
node. Note however that the mirror on the newly joined node
will behave as a new
slave. Also note that it is the node-names that
must provided, not host-name or IP address. It is an error
to supply an empty array as the value of the
x-ha-policy-params
entry: the array must
contain at least one element.Some examples:
Map<String, Object> args = new HashMap<String, Object>(); args.put("x-ha-policy", "all"); channel.queueDeclare("myqueue", false, false, false, args);This declares a queue
myqueue
which has a mirror on every node that is in the cluster or joins the cluster.
Map<String, Object> args = new HashMap<String, Object>(); args.put("x-ha-policy", "nodes"); args.put("x-ha-policy-params", Arrays.asList("node1@rabbit", "node2@rabbit", "node4@rabbit")); channel.queueDeclare("myqueue", false, false, false, args);This declares a queue
myqueue
which has a
mirror on each of the nodes node1@rabbit
,
node2@rabbit
and node4@rabbit
.