[rabbitmq-discuss] Cluster fails when current primary node shuts down.

Thu Apr 5 00:37:44 BST 2012

On Mon, Mar 19, 2012 at 5:44 AM, Simon MacMullen <simon at rabbitmq.com> wrote:
> On 15/03/12 19:57, Travis wrote:
>>
>> When this completed on machineA, instead of the cluster failing over,
>> the rabbitmq on machineB died.  What we then noticed was that when we
>> tried to start up machineB's rabbitmq-server, it would fail the
>> startup process.  machineB would only ever start up if machineA's
>> rabbitmq-server was started first.
>>
>> note:  this is ONLY happening on our production cluster; I can't seem
>> to reproduce it in our QA environment.  I suspect something is whacked
>> in cluster config in production.
>
>
> OK, that's alarming.
>

Ok, so I've gotten another chance to test this on our production
server.  What appears to be happening is that, when we initiate the
stop on machineA, rabbitmq *looks* like it fails the mirrored queues
over to machineB, but as soon as this completes, the management plugin
dies on machineB.  We can confirm that the queues are alive because
our applications consuming from the queue continue chugging along
happily, processing messages that have been published.

rabbitmqctl cluster_status also shows that the instance on machineB is alive.

Oddly, if I've failed from machineA to machineB, done a
stop_app/start_app to get the management plugin working on machineB,
brought up machineA again, and then initiated the failover back to
machineA, the failover completes without issue (eg, the management
plugin has no failures).

So ...

1)  As I understand it, the node in the management interface
associated with the stats service is the node considered the master.
How can I tell which node has the stats service running if the
management plugin is not running?
2) How do I restart the management plugin and associated parts without
doing a full stop_app/start_app?
3) What other things would be needed to debug this further?  I can
reliably cause this issue to occur failing from machineA to machineB.

note:  On the off chance that there was some cruft in the on-disk
config in mnesia for machineB, I proactively pulled machineB out of
the cluster, reset it, and added it back in as the second disc node.
This appears to have not made any appreciable change in behavior.

Travis
-- 
Travis Campbell
travis at ghostar.org