[rabbitmq-discuss] Cluster busting - shut off all nodes at the same time.

Tue Oct 30 14:45:21 GMT 2012

I am testing what happens when bad things happen to a RabbitMQ cluster so 
that we have an idea what to expect.  This thread is to ask about something 
unexpected.

My basic understanding of the RabbitMQ cluster is what I have read 
http://www.rabbitmq.com/ha.html and experienced in my testing.  

My testing scenario was the following.  
What happens if all cluster nodes where shut down at the same time with 
mirrored persisted data?  No clients were attached to the cluster at this 
time. 
What I was expecting is when the nodes were booted up they would all come 
back online and figure out what they needed for the master of the queue and 
not lose any data.
What I experienced was each server booted up but RabbitMQ failed to start 
on every cluster server and issued an error plus a "erl_crash.dump".  The 
cluster was dead upon start up.  Knowing that RabbitMQ needs to negotiate 
with the cluster to determine its state of the queue I prepared each server 
to start rabbitMQ.  I quickly started the RabbitMQ service on each server. 
 This allowed the nodes time to find each other and the cluster is back 
online.   The queue is online with the expected 101 messages but is 
currently not a synchronized mirror.  Only one node has the queue and the 
data. The other two nodes support the mirror but are not synchronized with 
the existing data.

This is how the test was performed.  A 3 server cluster. Each node is a VM 
guest on a single host running all 3 guest servers.  I hard stopped the 
host which brought down each guest. (preventing the rabbitMQ cluster 
negotiation of masters and notifications of shutdowns).  Restarted the host 
and restarted each guest at the same time.

What I am wondering is what is the best way to bring a cluster back online 
after something like this?  Basically the scenario is like a RabbitMQ 
cluster is found offline.  All servers are off.  You have to bring on the 
cluster without data loss to the persisted queues.  How would you go about 
doing this?  With an idle cluster might be easier but if you have live 
clients trying to connect to the cluster ready to use any nodes brought 
online I bet would be much harder.
Another question is how to have RabbitMQ come back online from a crash like 
this better than having to race through all of the servers starting each 
node. 

-Mark

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20121030/cb0d1a49/attachment.htm>