[rabbitmq-discuss] Problems With OCF Script - Race between start and wait

Mon May 9 16:54:34 BST 2011

Hi Chris,

On Fri, May 06, 2011 at 04:27:40PM +0000, Chris Chew wrote:
> Doing some testing on 2.4.1 towards upgrading from 2.2, I've noticed an odd behavior when using the OCF scripts to start Rabbit ala the Pacemaker guide.
> 
> There seems to be a race condition in the time I call `rabbitmq-server...` and then call `rabbitmqctl -n <host> wait`:  If the erlang node hasn't started up yet, then `rabbitmqctl -n <host> wait` will fail saying the node is down.  This failure causes the OCF process to fail and then falsely report that Rabbit is not running (rabbit still starts up just fine).
> 
> I fixed this temporarily on my installation by adding `sleep 2` in the rabbit_start() method of the rabbitmq-server OCF script between the lines where the server is started and the rabbit_wait function is invoked.
> 
> Interesting, and this through off the path for a while, the race condition is only lost when I install the management plugin && set mnesia_base to a shared (network-attached) volume.  Just one or the other is not enough to lose the race, but both are.  This might also helps explain why this might not have been seen before.

Having thought about this some more, I'm now puzzled. ctl -n <host> wait
should wait for a minimum of 5 seconds before giving up on the host.
This should be enough time for the rabbit erlang node to start (if not
fully boot). As long as the erlang node is contactable in some form
within 5 seconds then the wait will then wait forever for the erlang
node to fully boot and then report if rabbit's managed to start up ok or
not.

What you're reporting seems to be the inverse: i.e. if the server starts
too quickly then you encounter a problem.

Could you clarify?

Matthew