[rabbitmq-discuss] Problems With OCF Script - Race between start and wait

Fri May 13 12:46:36 BST 2011

On Mon, May 09, 2011 at 11:23:28PM +0000, Chris Chew wrote:
> Yes, I am seeing that 5 second wait behavior from rabbitmqctl, it's
> just that apparently 5 seconds is not long enough for the rabbit node
> to boot when using this shared volume as the mnesia_base.  I'm
> wondering if the rabbit_start() function should have a "spinner" built
> in just like the rabbit_stop() function has.

I think we should probably just wait indefinitely provided the pid that
we get back from the setsid is still alive (i.e. kill -0 $PID doesn't
error). Magic numbers are evil, and bumping 5 to 30 is no less evil.

> I can "fix" the problem by simply adding a `sleep 2` inside the rabbit_start() function like this:
> 
> ========================================================================
> rabbit_start() {
>     ...
>     setsid sh -c "$RABBITMQ_SERVER > ${RABBITMQ_LOG_BASE}/startup_log 2> ${RABBITMQ_LOG_BASE}/startup_err" &
> 
>     # Wait for the server to come up.
>     # Let the CRM/LRM time us out if required
>     
>     sleep 2  #Chris: Just like Pete Rose, we want rabbitmqctl to lose the race between it and the rabbit-server vm startup...
> 
>     rabbit_wait
>     ...
> ========================================================================
> 
> ...of course the "sleep <arbitrary amount of time>" is not a proper
> fix.  I noticed the rabbit_stop() function has a "spinner" that waits
> for rabbit to shut down, deferring to the cluster stack to abort the
> process upon timeout...does it make sense to add the same type of wait
> loop for rabbit_start() and push the concern of the timeout into the
> cluster stack?

Yes, and OCF is very much in favour of that. But I think OCF can do that
even if the actual script doing the booting of Rabbit is blocked waiting
for it to come up.

Thanks for the details. I'll file a bug.

Matthew