[rabbitmq-discuss] Cluster busting - shut off all nodes at the same time.

Tue Oct 30 16:49:37 GMT 2012

I am not sure quite what you are saying. You say that when you started 
the nodes again, none of them successfully started? And there was "an 
error". But then you started them "quickly" and that worked?

When each node is started it decides whether it thinks there are any 
other nodes which were running when it was killed. If so it waits 30 
seconds for them to become available and if nothing appears gives an 
error about "timeout waiting for tables",

Was this the error you saw?

We might make this 30 seconds configurable in future, but we need to 
think of the other case (where people start one node and not the other, 
and don't realise anything is wrong until the timeout).

You should also read:
http://www.rabbitmq.com/ha.html#unsynchronised-slaves

Cheers, Simon

On 30/10/12 14:45, Mark Ward wrote:
> I am testing what happens when bad things happen to a RabbitMQ cluster
> so that we have an idea what to expect.  This thread is to ask about
> something unexpected.
>
> My basic understanding of the RabbitMQ cluster is what I have read
> http://www.rabbitmq.com/ha.html and experienced in my testing.
>
> My testing scenario was the following.
> What happens if all cluster nodes where shut down at the same time with
> mirrored persisted data?  No clients were attached to the cluster at
> this time.
> What I was expecting is when the nodes were booted up they would all
> come back online and figure out what they needed for the master of the
> queue and not lose any data.
> What I experienced was each server booted up but RabbitMQ failed to
> start on every cluster server and issued an error plus a
> "erl_crash.dump".  The cluster was dead upon start up.  Knowing that
> RabbitMQ needs to negotiate with the cluster to determine its state of
> the queue I prepared each server to start rabbitMQ.  I quickly started
> the RabbitMQ service on each server.  This allowed the nodes time to
> find each other and the cluster is back online.   The queue is online
> with the expected 101 messages but is currently not a synchronized
> mirror.  Only one node has the queue and the data. The other two nodes
> support the mirror but are not synchronized with the existing data.
>
> This is how the test was performed.  A 3 server cluster. Each node is a
> VM guest on a single host running all 3 guest servers.  I hard stopped
> the host which brought down each guest. (preventing the rabbitMQ cluster
> negotiation of masters and notifications of shutdowns).  Restarted the
> host and restarted each guest at the same time.
>
> What I am wondering is what is the best way to bring a cluster back
> online after something like this?  Basically the scenario is like a
> RabbitMQ cluster is found offline.  All servers are off.  You have to
> bring on the cluster without data loss to the persisted queues.  How
> would you go about doing this?  With an idle cluster might be easier but
> if you have live clients trying to connect to the cluster ready to use
> any nodes brought online I bet would be much harder.
> Another question is how to have RabbitMQ come back online from a crash
> like this better than having to race through all of the servers starting
> each node.
>
> -Mark
>
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>

-- 
Simon MacMullen
RabbitMQ, VMware