[rabbitmq-discuss] Problems With OCF Script - Race between start and wait

Chris Chew chrisch at ecollege.com
Fri May 6 17:27:40 BST 2011


Hi Everybody.

Doing some testing on 2.4.1 towards upgrading from 2.2, I've noticed an odd behavior when using the OCF scripts to start Rabbit ala the Pacemaker guide.

There seems to be a race condition in the time I call `rabbitmq-server...` and then call `rabbitmqctl -n <host> wait`:  If the erlang node hasn't started up yet, then `rabbitmqctl -n <host> wait` will fail saying the node is down.  This failure causes the OCF process to fail and then falsely report that Rabbit is not running (rabbit still starts up just fine).

I fixed this temporarily on my installation by adding `sleep 2` in the rabbit_start() method of the rabbitmq-server OCF script between the lines where the server is started and the rabbit_wait function is invoked.

Interesting, and this through off the path for a while, the race condition is only lost when I install the management plugin && set mnesia_base to a shared (network-attached) volume.  Just one or the other is not enough to lose the race, but both are.  This might also helps explain why this might not have been seen before.

Anyways...does this seem plausible or am I just watching shadows on a cave wall, as it were?

Thanks!

Chris Chew


More information about the rabbitmq-discuss mailing list