[rabbitmq-discuss] Colocation Ignored, Cluster Resources Moved and Failing

Fri Sep 9 18:13:11 BST 2011

I am trying to setup a cluster of machines to provide an HA RabbitMQ solution.

I have been following the instructions on this page...

http://www.rabbitmq.com/pacemaker.html

...to no avail.

As of this writing, I have narrowed down the problems I'm having.
Either my cluster is not configured correctly, or it is ignoring the
colocation directives.  There may be other problems, such as I don't
understand how to setup colocation and the cluster in general.

Symptoms...

I have corosync/pacemaker cluster running on one node.  When I start
the second node in the cluster, the cluster attempts to start the
resource 'bunny1' on the second node.  This, despite the colocation
directives.  When the startup on the second node is attempted, it
fails miserably because of missing dependent resources ('bunny1_fs'
and 'bunny1_ip').  As such, not only is the colocation directive being
ignored, so is the order directive which should tell the cluster the
order in which to start things.  When this failure occurs, the only
recourse I've found for getting the cluster back into working order is
to hard kill -9 corosync on the second node.

I'm unclear what information will be useful in fixing this, but I
assume the cluster config is required...

node rabbitmq3.colo.bluestatedigital.com
node rabbitmq4.colo.bluestatedigital.com
primitive bunny1 ocf:rabbitmq:rabbitmq-server \
       params mnesia_base="/mnt/bunny1/mnesia" ip="10.0.210.35"
nodename="rabbit at riak1" \
       meta target-role="Started"
primitive bunny1_drbd ocf:linbit:drbd \
       params drbd_resource="r0" \
       op monitor interval="10s" role="Master" \
       op monitor interval="30s" role="Slave" \
       op start interval="0" timeout="240s" \
       op stop interval="0" timeout="100s"
primitive bunny1_fs ocf:heartbeat:Filesystem \
       params device="/dev/drbd0" directory="/mnt/bunny1" fstype="ext3" \
       op start interval="0" timeout="60s" \
       op stop interval="0" timeout="60s" \
       meta target-role="Started"
primitive bunny1_ip ocf:heartbeat:IPaddr2 \
       params ip="10.0.210.35" cidr_netmask="16"
ms bunny1_drbd_ms bunny1_drbd \
       meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Master"
colocation bunny1_colo1 inf: bunny1 bunny1_ip
colocation bunny1_colo2 inf: bunny1 bunny1_fs
colocation bunny1_fs_colo inf: bunny1_fs bunny1_drbd_ms:Master
order bunny1_fs_order inf: bunny1_drbd_ms:promote bunny1_fs:start
order bunny1_order inf: bunny1_fs bunny1_ip bunny1
property $id="cib-bootstrap-options" \
       dc-version="1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87" \
       cluster-infrastructure="openais" \
       expected-quorum-votes="2" \
       stonith-enabled="false" \
       no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
       resource-stickiness="0"

I have tried several variations of the colocation commands, none have
worked.  I have tried to find the answer via Google and in the many
docs, wiki pages, and forums, again, all to no avail.  I've tried
looking through the log files for the cluster, but they contain so
much info that I'm not sure if I can't find a relevant tidbit or
whether it just doesn't exist.

Any help will be very much appreciated.