[rabbitmq-discuss] rabbitmqctl start_app hangs when replacing mirrored cluster instances in EC2

Mike Zraly mzraly at gmail.com
Mon Jul 7 12:30:40 BST 2014


[I tried posting this to the new group, rabbitmq-users, but got no 
response.  Google groups tells me rabbitmq-users only has 101 members now, 
compared to 1800 or so for rabbitmq-discuss, so I hope re-posting to the 
larger group will at least elicit some (non-meta) feedback.]

Hi all,

I am setting up a RabbitMQ cluster in an Amazon EC2 region.  Each host is 
in the same geographical region, so I do not expect network partitions in 
the sense that two members of the cluster are both running but cannot 
communicate with each other.  However it is reasonable to expect individual 
cluster hosts to be terminated and replaced with new hosts having the same 
hostname but a new IP address and a fresh install of RabbitMQ.  A typical 
use case for this is a rolling upgrade where we keep 2 of the 3 cluster 
nodes up at all times to continue providing service during the upgrade 
period.

What I hope is that the same post-install provisioning script that joins a 
newly created instance into the cluster will work for the new instance that 
is taking over for an older one.  What I am seeing is rabbitmqctl start_app 
hang.

The installation sequence is basically this:

install rabbitmq-server_3.3.1-1
enable management plugin
add health check user account with monitoring tag
add application user account
add HA policy '{"ha-mode": "all", "ha-sync-mode": "automatic"} for all 
application queues
service rabbitmq-server stop
set /var/lib/rabbitmq/.erlang.cookie
reboot system (restarting rabbitmq server)
for each hostname 'target' that this host should join into a cluster with:
    if target is listening on port 5672
        rabbitmqctl stop_app
        if rabbitmqctl join_cluster target has non-zero exit status
            rabbitmqctl start_app

What I see if I start a cluster with hosts A, B, and C, then terminate 
instance C and replace it with a new instance that executes these same 
steps, is that rabbitmqctl join_cluster succeeds saying C is already part 
of the cluster, then rabbitmqctl start_app hangs.

What am I doing wrong?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140707/be95dce0/attachment.html>


More information about the rabbitmq-discuss mailing list