[rabbitmq-discuss] General questions about HA, Stability/Reliability and Broker Administration

Tue Jul 27 15:33:42 BST 2010

On Sun, Jul 25, 2010 at 1:33 AM, Dave Greggory <davegreggory at yahoo.com>wrote:

> 2. HA/Failover: I've seen the Pacemaker guide but I'm a little hesitant to
> set
> that up as we have little experience in house with Pacemaker/Corosync/DRBD.
> How
> many people use it for HA/Failover in production systems and how happy are
> you
> with it? Does it support failing over if the hard drive on one of the nodes
> die
> instead something a little more simple like a node running out of memory or
> hanging?
>

We are contemplating this and have done some trialling/testing. For us the
question is between providing HA at the xenserver level or at the host/app
level using pacemaker etc.

It took a little bit of fiddling to get it running with pacemaker (this was
before the HA document was available), but once we had the system working,
it worked/works well. Our solution used/uses shared ISCSI storage rather
than DRDB and so relies on the reliability of the SAN. If a drive on one of
the hosts fails (such as the root/other partition) and this causes
difficulties for the status check script, it will failover to the other
node. We assume that the drives containing the rabbitmq storage are "safe"
through redundancy (RAID1, redundant storage controllers etc)

At this stage we are leaning towards the xenserver level due to lower
complexity and still satisfying our requirements. We've also had some
hardware changes (production system will now be on a FC SAN rather than
ISCSI) and have not done the work testing on the new configuration yet.

In terms of monitoring, we generally run rabbitmq_ctl list_queues as part of
a munin plugin. We plan to hook it up to nagios, but havent done so yet.

Hope this is useful information.

Joe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20100728/ea223da0/attachment.htm>