<br><br><div class="gmail_quote">On Sun, Jul 25, 2010 at 1:33 AM, Dave Greggory <span dir="ltr"><<a href="mailto:davegreggory@yahoo.com">davegreggory@yahoo.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
2. HA/Failover: I've seen the Pacemaker guide but I'm a little hesitant to set<br>
that up as we have little experience in house with Pacemaker/Corosync/DRBD. How<br>
many people use it for HA/Failover in production systems and how happy are you<br>
with it? Does it support failing over if the hard drive on one of the nodes die<br>
instead something a little more simple like a node running out of memory or<br>
hanging?<br></blockquote><div><br>We are contemplating this and have done some trialling/testing. For us the question is between providing HA at the xenserver level or at the host/app level using pacemaker etc.<br><br>
It took a little bit of fiddling to get it running with pacemaker (this was before the
HA document was available), but once we had the system working, it
worked/works well. Our solution used/uses shared ISCSI storage rather than DRDB and so relies on the reliability of the SAN. If a drive on one of the hosts fails (such as the root/other partition) and this causes difficulties for the status check script, it will failover to the other node. We assume that the drives containing the rabbitmq storage are "safe" through redundancy (RAID1, redundant storage controllers etc)<br>
<br>At this stage we are leaning towards the xenserver level due to lower complexity and still satisfying our requirements. We've also had some hardware changes (production system will now be on a FC SAN rather than ISCSI) and have not done the work testing on the new configuration yet.<br>
<br>In terms of monitoring, we generally run rabbitmq_ctl list_queues as part of a munin plugin. We plan to hook it up to nagios, but havent done so yet.<br><br>Hope this is useful information.<br><br>Joe<br><br><br></div>
</div>