[rabbitmq-discuss] [PATCH 00 of 10] Several improvements to the OCF resource agent

Tue May 11 20:26:00 BST 2010

On 05/11/2010 09:04 PM, Matthew Sackman wrote:
> On Tue, May 11, 2010 at 08:50:19PM +0200, Florian Haas wrote:
>> On 05/11/2010 08:01 PM, Matthew Sackman wrote:
>>> These are excellent, and I have no doubt they will likely all be
>>> accepted. As I'm sure you've been able to gather, some of the
>>> documentation and example scripts that I've read in order to be able to
>>> write the OCF script are out of date themselves, hence some of the
>>> issues you've spotted and corrected.
>>
>> Would you mind sharing exactly what documentation and example scripts
>> you were following? We should get those fixed.
> 
> Sure, a lot of it was just done from reading other OCF scripts, such as
> the DRBD, IPAddr and such like. But I think I got the most out of the
> docs at
> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-ocf.html
> 
> For example, that only talks of the monitor action, not status, and no
> where that I found is there any real documentation as to what the
> library of functions that is supplied with pacemaker do, nor how or when
> they should be used (eg the ocf_is_probe function you've used).

Those are not part of Pacemaker, they ship with the Linux-HA resource
agents package. The relevant functions library is in
$OCF_ROOT/resource.d/heartbeat/.ocf-shellfuncs.

>> One other thing that came to mind while looking at the RA: the
>> recommended minimum start timeout of 600s seems a bit excessive.
>> Starting with Pacemaker 1.0.8 the crm shell will warn if the
>> configuration provides for shorter timeouts than the RA recommends. Sure
>> you need a 10-minute start timeout?
> 
> Currently, startup time can be very long if you have an awful lot of
> data to recover from disk. We think this might be partially fixed very
> soon as some work that's recently been done will all Rabbit to come up
> even before all recovery is complete (only the queues still being
> recovered will continue to be unavailable). However, even in this case,
> there are still some internal resources that must be fully recovered
> before Rabbit can be in any way considered to be up, and that can be
> proportional to the amount of data it has previously stored on disk.
> 
> Thus, in conclusion, yes, 10 mins may be far too long. But in some cases
> it may also be too short. Any advice you have as to what we should be
> doing wrt the OCF script would be gratefully received.

There's a misconception in play here, I'm afraid. What you define as the
timeout in the RA metadata is just a recommendation for the minimum
timeout the cluster administrator should configure. It doesn't actually
set or enforce that timeout, it merely recommends one to be set. Long
story short: you don't want to set this piece of metadata to the maximum
time you'd expect startup to take, but the time that most real-word
configurations will take to start, plus some 20-30% extra.

MySQL (with InnoDB recovery) has much similar issues when failing over
hard on DRBD or shared storage. The recommended minimum start timeout
there is 120s.

Cheers,
Florian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 262 bytes
Desc: OpenPGP digital signature
Url : http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20100511/f20ec55c/attachment.pgp