[rabbitmq-discuss] Broker failover

Wed Aug 19 15:41:59 BST 2009

Hi Niko,

On Wed, Aug 19, 2009 at 03:06:50PM +0100, Niko Felger wrote:
> Are there any best practices how to achieve broker failover?
> 
> We are currently using two clustered nodes with durable queues and
> exchanges. The clients are configured to connect to the first node. In the
> event that this node dies, I would like both existing consumers as well as
> newly started ones to connect to the other node. Are there standard patterns
> or recipies to achieve this?

There's nothing standard just yet, but we're getting a lot of interest
in this area and are working on solutions. Just at the moment the
situation is as follows:

Due to the way mnesia works, you can't just transfer the files from one
machine to another and start the broker up. To make this work, both
machines must have the same hostname as mnesia records this in the
database. To solve this, you can just use the nodename of
rabbit at localhost. However, this prevents you doing clustering, which is
a shame.

Therefore, if HA and failover is important to you, we'd recommend the
following:

1) Put a simple TCP/IP load balancer in front of the nodes of rabbits,
but do this only for producers. The load balancer needs to be able to
dynamically cope with nodes going down, reappearing etc.
2) For consumers you really want them to all try and consume from all
the nodes at the same time. They also need to be able to silently cope
with nodes going down and reappearing. Obviously the exact details of
this vary between application.
3) Have a SAN with some shared storage which is not partitioned. All the
rabbit nodes need access to this.
4) Use Linux-HA or equiv to do monitoring of your rabbit nodes, and
start up all the brokers with the nodename of rabbit at localhost

Now, when a node fails, Linux-HA will notice, and should tell a spare
node to start up, setting the RABBITMQ_MNESIA_DIR to the location on the
SAN of the files for the failed node. It should all just start up.

Obviously, this depends on the reliability and availability of your SAN,
and the drawbacks of not having clustering available complicate at least
consumers. However, if HA and failover is more important then this may
be a tradeoff you're willing to make just at the moment.

Also, be aware that with this solution, non persistent messages can be
lost as a node goes down, and even persistent messages which are not
part of a transaction can also be lost.

Needless to say, a more comprehensive solution is on our TODO list, but
may be a little way off just at the moment.

I hope this helps,

Matthew