[rabbitmq-discuss] Best practices for cluster upgrades with uninterrupted service

Wed Dec 18 10:28:11 GMT 2013

Hi Matt. I think your original plan is reasonable for feature upgrades.

For bugfix upgrades you should be able to cut out some of the steps - in 
particular you do not need to add and remove nodes from the cluster in 
order to upgrade them, you can just take them down and upgrade them one 
at a time while they are still registered in the cluster.

Cheers, Simon

On 09/12/2013 22:42, Matt Pietrek wrote:
> Hey all RabbitMQ devs and experts,
>
> I’m looking for a validation of an approach, or a better suggestion on
> how to accomplish.
>
> The big picture is that we want to improve how we upgrade our clusters
> without interruptions in service. I know RabbitMQ supports running mixed
> versions within a cluster, but for us, upgrade may also mean bringing
> down a node for reconfiguration, an Erlang upgrade, or any other number
> of scenarios.
>
> Today we do controlled switchover of load from one RabbitMQ cluster to
> another, and then switch it back. We use two clusters (a master and an
> alternate). Both clusters are identically configured with two RabbitMQ
> instances, i.e “master” is node1,node2, and “alternate” is node3,node4.
>
> All queues are mirrored. We direct traffic to the master or alternate
> via a VIP.
>
> Today our upgrade process looks like this:
>
> --------
>
> * master is operation normally, and pointed to by the VIP.
>
> * Shut down both alternate nodes, upgrade them in whatever manner is
> necessary. Newer version, more ram, whatever.
>
> * Start up alternate cluster, then create the same set of queues on it
> as the master has.
>
> * Redirect the VIP to point to the alternate.
>
> * Clients see a connection drop because of the VIP change, but are
> tolerant and auto-reconnect.
>
> * Copy all queue contents from the master’s queues to the alternate’s
> same-named queues.
>
> * Shut down the master nodes and upgrade them.
>
> * Perform similar steps as above to move the VIP and messages back to
> the master.
>
> --------
>
> While this generally works, there are small but important problems with
> it. One problem is that some of our queues are created/deleted. I have
> seen scenarios where unfortunate timing can cause a queue to not be on
> the right cluster at the right moment.
>
> Looking at newer RabbitMQ features, in particular policy based mirroring
> control, I’m thinking something like this would be better:
>
> -------
>
> * Have node1/node2 running along normally, with the VIP pointing at it.
>
> * Start up node3/node4 in the same cluster
>
> * Use policy to make all queues mirrored by all four nodes. Use
> synch_queue as necessary to force synchronization in a reasonable time
> frame.
>
> * When all queues are synched, remove node1/node2 from the cluster and
> upgrade them.
>
> * Because our VIP is managed by keepalived, either node3 or node4 will
> obtain the VIP when node1/node2 are removed.
>
> --------
>
> The advantage of this is that it eliminates queue location timing
> windows and we don’t have to manually copy messages around. The only
> downside I’m aware of is that it won’t work when upgrading major/minor
> versions. That is, it should be fine from 3.4.3 to 3.4.7, but not 3.4 to
> 3.5. In that case, we'd use our original upgrade logic.
>
> Thoughts? Suggestions? Better ideas?
>
> Thanks,
>
>
> Matt
>
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>