[rabbitmq-discuss] Clustered startup with multiple queues and multiple masters

Wed Jun 13 12:34:35 BST 2012

Another thing:

>   3) Reset it, force_cluster it to the disc node you brought up,
>      and then reset it again. This will make the disc node believe that

This is not necessary, a reset here should be enough, without having
to re-cluster it. Sorry about the confusion.

Francesco.

At Wed, 13 Jun 2012 10:54:31 +0100,
Francesco Mazzoli wrote:
> 
> Hi,
> 
> > As I understand from other messages on this forum, in a clustered
> > setup, the last node shut down should be the first node set up. Again
> > (in my possibly incorrect assumption), this is because Rabbit and/or
> > Mnesia may wait for what they believe to be the previous master to
> > come up first.
> 
> That's correct, this is because mnesia wants to make sure that the
> node with the most up-to-date dataset starts up first, so that we
> avoid diverging tables.
> 
> > Now, consider a situation like this, where there are N queues that are
> > mastered on different brokers (e.g, rabbit at play, rabbit at play2). If we
> > pulled the power cord on all these machines, what should the node
> > startup order be?
> 
> If you shut down the nodes abruptly, rabbit won't complain when
> starting the nodes because in whatever order because it won't know
> about the running nodes at the time of shutdown (which are recorded in
> a file in the shutdown sequence). In other words, it's up to you to
> restart them so that the node with the most up-to-date mnesia is
> started first (what will happen if mnesia thinks that we're not the
> most up to date one is mnesia will hang waiting for the table copies
> in the other nodes, which are offline).
> 
> > And at the risk of asking a broader question, what is the recommended
> > approach to restarting from a catastrophic power failure where all
> > nodes go down within a very short period of time?
> 
> I would say that the safest thing to do here is:
> 
>   1) Start one disc node. If it hangs waiting for the table, try the
>      next one until one works. If none works, things are ugly, and
>      I can think of ways of fixing them manually but that's more
>      complicated (and dangerous)
>   2) Start another node without starting rabbit (you can do
>      that setting the RABBITMQ_NODE_ONLY env variable)
>   3) Reset it, force_cluster it to the disc node you brought up,
>      and then reset it again. This will make the disc node believe that
>      the original node has left the cluster.
>   4) Once you have done this for each node, you will be left with only
>      one node which is not in a cluster, and you can cluster your nodes
>      back to that one.
> 
> This is pretty ugly but it's the only safe way in all situations, due
> to the possibility of the nodes performing upgrades. If you're sure
> that the nodes won't need to upgrade (e.g. same version of rabbit and
> erlang) you can perform step 1 and then just start the other nodes
> normally later, and it should be OK. Someone else in the team might
> have a better idea, but I don't :).
> 
> By the way, we're working hard on making this process and in general
> clustering simpler and safer, so in the future things should be
> better.
> 
> Francesco
> 
> At Tue, 12 Jun 2012 10:29:52 -0700,
> Matt Pietrek wrote:
> > 
> > Looking for some clarification here.
> > 
> > As I understand from other messages on this forum, in a clustered
> > setup, the last node shut down should be the first node set up. Again
> > (in my possibly incorrect assumption), this is because Rabbit and/or
> > Mnesia may wait for what they believe to be the previous master to
> > come up first. By starting up the "master" first, any blocking/waiting
> > can be avoided. In addition, message loss can be avoided by preventing
> > a prior out-of-sync slave from becoming the master.
> > 
> > Now, consider a situation like this, where there are N queues that are
> > mastered on different brokers (e.g, rabbit at play, rabbit at play2). If we
> > pulled the power cord on all these machines, what should the node
> > startup order be?
> > 
> > real_cm rabbit at play +2  HA D Active 0 0 0
> > aliveness-test rabbit at play  Active 0 0 0
> > carbon rabbit at play +2  HA D Idle 0 0 0
> > cmcmd rabbit at play +2  HA D Idle 0 0 0
> > fake_cm rabbit at play2 +2  HA D Idle 0 0 0
> > fake_mu_queue rabbit at play2 +2  HA D Idle 0 0 0
> > fake_service_2 rabbit at play +2  HA D Idle 0 0 0
> > random rabbit at play +2  HA D Idle
> > 
> > And at the risk of asking a broader question, what is the recommended
> > approach to restarting from a catastrophic power failure where all
> > nodes go down within a very short period of time?
> > 
> > In our experiments with RabbitMQ 2.82, Ubuntu 10.04 and Erlang R13B03,
> > it's a total crap shoot whether the cluster comes back up or hangs
> > with all nodes stuck at the "starting database...." point.
> > 
> > Thanks,
> > 
> > Matt
> > _______________________________________________
> > rabbitmq-discuss mailing list
> > rabbitmq-discuss at lists.rabbitmq.com
> > https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss