[rabbitmq-discuss] Cluster fails when current primary node shuts down.

Tue Apr 10 12:21:05 BST 2012

On 05/04/2012 12:37AM, Travis wrote:
> Ok, so I've gotten another chance to test this on our production
> server.  What appears to be happening is that, when we initiate the
> stop on machineA, rabbitmq*looks*  like it fails the mirrored queues
> over to machineB, but as soon as this completes, the management plugin
> dies on machineB.  We can confirm that the queues are alive because
> our applications consuming from the queue continue chugging along
> happily, processing messages that have been published.

Oh, that's odd, and definitely should not be happening.

> rabbitmqctl cluster_status also shows that the instance on machineB is alive.
>
> Oddly, if I've failed from machineA to machineB, done a
> stop_app/start_app to get the management plugin working on machineB,
> brought up machineA again, and then initiated the failover back to
> machineA, the failover completes without issue (eg, the management
> plugin has no failures).
>
> So ...
>
> 1)  As I understand it, the node in the management interface
> associated with the stats service is the node considered the master.
> How can I tell which node has the stats service running if the
> management plugin is not running?

The stats service is part of the management plugin, so if mgmt is not 
running it isn't either. There's no inherent "master" node in a RabbitMQ 
cluster; the node with stats is just the node with stats (and this 
doesn't have to mean anything with respect to where mirrored queues have 
their masters).

> 2) How do I restart the management plugin and associated parts without
> doing a full stop_app/start_app?

You should be able to do:

rabbitmqctl eval 'application:stop(rabbitmq_management).'
rabbitmqctl eval 'application:start(rabbitmq_management).'

to just restart the management plugin.

> 3) What other things would be needed to debug this further?  I can
> reliably cause this issue to occur failing from machineA to machineB.

Logs (including the "sasl" log) from when the failure occurs. If you can 
reliably reproduce this that's great, I would like to get to the bottom 
of this.

Cheers, Simon