[rabbitmq-discuss] Cluster fails when current primary node shuts down.

Thu Mar 15 19:57:28 GMT 2012

We're running RabbitMQ 2.7.1 on CentoOS 5.6 with erlang R14B in a
clustered configuration with mirrored queues.

Both nodes in the cluster are configured as disc nodes.

Yesterday, we attempted to force a failover of the cluster from
machineA to machineB, by doing a

  service rabbitmq-server stop

on machineA.

When this completed on machineA, instead of the cluster failing over,
the rabbitmq on machineB died.  What we then noticed was that when we
tried to start up machineB's rabbitmq-server, it would fail the
startup process.  machineB would only ever start up if machineA's
rabbitmq-server was started first.

note:  this is ONLY happening on our production cluster; I can't seem
to reproduce it in our QA environment.  I suspect something is whacked
in cluster config in production.

Anyone else seen this?  Is service rabbitmq-server stop sufficient to
cause a safe failover?  Or is there a more preferred way?

Unfortunately, I don't have the log messages from machineB's
rabbitmq-server because it appears that they get overwritten upon
subsequent restarts of rabbitmq. :-(

Travis
-- 
Travis Campbell
travis at ghostar.org