[rabbitmq-discuss] Best practices for cluster upgrades with uninterrupted service

Mon Dec 9 22:42:18 GMT 2013

Hey all RabbitMQ devs and experts,

I’m looking for a validation of an approach, or a better suggestion on how
to accomplish.

The big picture is that we want to improve how we upgrade our clusters
without interruptions in service. I know RabbitMQ supports running mixed
versions within a cluster, but for us, upgrade may also mean bringing down
a node for reconfiguration, an Erlang upgrade, or any other number of
scenarios.

Today we do controlled switchover of load from one RabbitMQ cluster to
another, and then switch it back. We use two clusters (a master and an
alternate). Both clusters are identically configured with two RabbitMQ
instances, i.e “master” is node1,node2, and “alternate” is node3,node4.

All queues are mirrored. We direct traffic to the master or alternate via a
VIP.

Today our upgrade process looks like this:

--------

* master is operation normally, and pointed to by the VIP.

* Shut down both alternate nodes, upgrade them in whatever manner is
necessary. Newer version, more ram, whatever.

* Start up alternate cluster, then create the same set of queues on it as
the master has.

* Redirect the VIP to point to the alternate.

* Clients see a connection drop because of the VIP change, but are tolerant
and auto-reconnect.

* Copy all queue contents from the master’s queues to the alternate’s
same-named queues.

* Shut down the master nodes and upgrade them.

* Perform similar steps as above to move the VIP and messages back to the
master.

--------

While this generally works, there are small but important problems with it.
One problem is that some of our queues are created/deleted. I have seen
scenarios where unfortunate timing can cause a queue to not be on the right
cluster at the right moment.

Looking at newer RabbitMQ features, in particular policy based mirroring
control, I’m thinking something like this would be better:

-------

* Have node1/node2 running along normally, with the VIP pointing at it.

* Start up node3/node4 in the same cluster

* Use policy to make all queues mirrored by all four nodes. Use synch_queue
as necessary to force synchronization in a reasonable time frame.

* When all queues are synched, remove node1/node2 from the cluster and
upgrade them.

* Because our VIP is managed by keepalived, either node3 or node4 will
obtain the VIP when node1/node2 are removed.

--------

The advantage of this is that it eliminates queue location timing windows
and we don’t have to manually copy messages around. The only downside I’m
aware of is that it won’t work when upgrading major/minor versions. That
is, it should be fine from 3.4.3 to 3.4.7, but not 3.4 to 3.5. In that
case, we'd use our original upgrade logic.

Thoughts? Suggestions? Better ideas?

Thanks,

Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131209/52688274/attachment.html>