[rabbitmq-discuss] Cluster fails when current primary node shuts down.
simon at rabbitmq.com
Tue Apr 10 12:21:05 BST 2012
On 05/04/2012 12:37AM, Travis wrote:
> Ok, so I've gotten another chance to test this on our production
> server. What appears to be happening is that, when we initiate the
> stop on machineA, rabbitmq*looks* like it fails the mirrored queues
> over to machineB, but as soon as this completes, the management plugin
> dies on machineB. We can confirm that the queues are alive because
> our applications consuming from the queue continue chugging along
> happily, processing messages that have been published.
Oh, that's odd, and definitely should not be happening.
> rabbitmqctl cluster_status also shows that the instance on machineB is alive.
> Oddly, if I've failed from machineA to machineB, done a
> stop_app/start_app to get the management plugin working on machineB,
> brought up machineA again, and then initiated the failover back to
> machineA, the failover completes without issue (eg, the management
> plugin has no failures).
> So ...
> 1) As I understand it, the node in the management interface
> associated with the stats service is the node considered the master.
> How can I tell which node has the stats service running if the
> management plugin is not running?
The stats service is part of the management plugin, so if mgmt is not
running it isn't either. There's no inherent "master" node in a RabbitMQ
cluster; the node with stats is just the node with stats (and this
doesn't have to mean anything with respect to where mirrored queues have
> 2) How do I restart the management plugin and associated parts without
> doing a full stop_app/start_app?
You should be able to do:
rabbitmqctl eval 'application:stop(rabbitmq_management).'
rabbitmqctl eval 'application:start(rabbitmq_management).'
to just restart the management plugin.
> 3) What other things would be needed to debug this further? I can
> reliably cause this issue to occur failing from machineA to machineB.
Logs (including the "sasl" log) from when the failure occurs. If you can
reliably reproduce this that's great, I would like to get to the bottom
More information about the rabbitmq-discuss