[rabbitmq-discuss] cluster meltdown

Tue Apr 1 05:10:14 BST 2014

We have a two node cluster, an AWS node froze for us, we could neither start
or stop it. That made the first node unresponsive to mgmt db actions, all
API request timed out. We restart the first node but a lot of queues are
then inaccessible:  

=ERROR REPORT==== 1-Apr-2014::02:58:03 ===
connection <0.5559.279>, channel 1 - soft error:
{amqp_error,not_found,
            "home node 'rabbit at node2' of durable queue 'celery' in vhost
'vhost1' is down or inaccessible",
            'queue.declare'}

We issue rabbitmqctl forget_cluster_node rabbit at node2 as we still can't
access node2. 

Node1 continue to report a lot of "home node of queue is down". 

Node2 has now restarted, but can't join the cluster. Is there a way to
rejoin the cluster without resetting? 

We reset node2 and tries to join_cluster again but with the following
result: 
Clustering node 'rabbit at node2' with 'rabbit at node1' ...
...done (already_member).

node2# rabbitmqctl cluster_status

Cluster status of node 'rabbit at node2' ...
[{nodes,[{disc,['rabbit at node2']}]},
 {running_nodes,['rabbit at node2']},
 {partitions,[]}]
...done.

But start_app doesn't join node1. 

node1# rabbitmqctl cluster_status
Cluster status of node 'rabbit at node1' ...
[{nodes,[{disc,['rabbit at node1','rabbit at node2']}]},
 {running_nodes,['rabbit at node1']},
 {partitions,[]}]
...done.

node2# rabbitmqctl update_cluster_nodes rabbit at node1

Now node2 understands that it's clustered with node1 and with start_app it
starts and joins node1. 

RabbitMQ 3.2.3, Erlang R16B03-1

--
View this message in context: http://rabbitmq.1065348.n5.nabble.com/cluster-meltdown-tp34453.html
Sent from the RabbitMQ mailing list archive at Nabble.com.