[rabbitmq-discuss] Problems With OCF Script - Race between start and wait

Mon May 9 16:47:34 BST 2011

Hi Chris,

On Fri, May 06, 2011 at 04:27:40PM +0000, Chris Chew wrote:
> Doing some testing on 2.4.1 towards upgrading from 2.2, I've noticed an odd behavior when using the OCF scripts to start Rabbit ala the Pacemaker guide.
> 
> There seems to be a race condition in the time I call `rabbitmq-server...` and then call `rabbitmqctl -n <host> wait`:  If the erlang node hasn't started up yet, then `rabbitmqctl -n <host> wait` will fail saying the node is down.  This failure causes the OCF process to fail and then falsely report that Rabbit is not running (rabbit still starts up just fine).
> 
> I fixed this temporarily on my installation by adding `sleep 2` in the rabbit_start() method of the rabbitmq-server OCF script between the lines where the server is started and the rabbit_wait function is invoked.
> 
> Interesting, and this through off the path for a while, the race condition is only lost when I install the management plugin && set mnesia_base to a shared (network-attached) volume.  Just one or the other is not enough to lose the race, but both are.  This might also helps explain why this might not have been seen before.
> 
> Anyways...does this seem plausible or am I just watching shadows on a cave wall, as it were?

Yeah, that does seem rather likely actually, and I'm annoyed I managed
to miss this possibility in the great rewrite of starting rabbit that
happened late last year - the relevant bug was reopened dozens of times
as more and more corrections were found, but it looks like this got
through.

I guess really it needs to look at the pid once the fork has happened
and work off that rather than whether or not the erlang node is
contactable. Sigh, I'll open a bug...

Thanks for the report though.

Matthew