[rabbitmq-discuss] Cluster busting - shut off all nodes at the same time.
Mark Ward
ward.mark at gmail.com
Tue Oct 30 14:45:21 GMT 2012
I am testing what happens when bad things happen to a RabbitMQ cluster so
that we have an idea what to expect. This thread is to ask about something
unexpected.
My basic understanding of the RabbitMQ cluster is what I have read
http://www.rabbitmq.com/ha.html and experienced in my testing.
My testing scenario was the following.
What happens if all cluster nodes where shut down at the same time with
mirrored persisted data? No clients were attached to the cluster at this
time.
What I was expecting is when the nodes were booted up they would all come
back online and figure out what they needed for the master of the queue and
not lose any data.
What I experienced was each server booted up but RabbitMQ failed to start
on every cluster server and issued an error plus a "erl_crash.dump". The
cluster was dead upon start up. Knowing that RabbitMQ needs to negotiate
with the cluster to determine its state of the queue I prepared each server
to start rabbitMQ. I quickly started the RabbitMQ service on each server.
This allowed the nodes time to find each other and the cluster is back
online. The queue is online with the expected 101 messages but is
currently not a synchronized mirror. Only one node has the queue and the
data. The other two nodes support the mirror but are not synchronized with
the existing data.
This is how the test was performed. A 3 server cluster. Each node is a VM
guest on a single host running all 3 guest servers. I hard stopped the
host which brought down each guest. (preventing the rabbitMQ cluster
negotiation of masters and notifications of shutdowns). Restarted the host
and restarted each guest at the same time.
What I am wondering is what is the best way to bring a cluster back online
after something like this? Basically the scenario is like a RabbitMQ
cluster is found offline. All servers are off. You have to bring on the
cluster without data loss to the persisted queues. How would you go about
doing this? With an idle cluster might be easier but if you have live
clients trying to connect to the cluster ready to use any nodes brought
online I bet would be much harder.
Another question is how to have RabbitMQ come back online from a crash like
this better than having to race through all of the servers starting each
node.
-Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20121030/cb0d1a49/attachment.htm>
More information about the rabbitmq-discuss
mailing list