[rabbitmq-discuss] Best practices for cluster upgrades with uninterrupted service

Mon Dec 16 17:37:59 GMT 2013

Hey all (and in particular the RabbitMQ devs).

Just wanted to call attention to my original question. I didn't get any
responses, and given the excellent level of help I've gotten in the past, I
figure my question was just overlooked in the general deluge of messages.

Thanks much,

Matt

On Mon, Dec 9, 2013 at 2:42 PM, Matt Pietrek <mpietrek at skytap.com> wrote:

> Hey all RabbitMQ devs and experts,
>
> I’m looking for a validation of an approach, or a better suggestion on how
> to accomplish.
>
> The big picture is that we want to improve how we upgrade our clusters
> without interruptions in service. I know RabbitMQ supports running mixed
> versions within a cluster, but for us, upgrade may also mean bringing down
> a node for reconfiguration, an Erlang upgrade, or any other number of
> scenarios.
>
> Today we do controlled switchover of load from one RabbitMQ cluster to
> another, and then switch it back. We use two clusters (a master and an
> alternate). Both clusters are identically configured with two RabbitMQ
> instances, i.e “master” is node1,node2, and “alternate” is node3,node4.
>
> All queues are mirrored. We direct traffic to the master or alternate via
> a VIP.
>
> Today our upgrade process looks like this:
>
> --------
>
> * master is operation normally, and pointed to by the VIP.
>
> * Shut down both alternate nodes, upgrade them in whatever manner is
> necessary. Newer version, more ram, whatever.
>
> * Start up alternate cluster, then create the same set of queues on it as
> the master has.
>
> * Redirect the VIP to point to the alternate.
>
> * Clients see a connection drop because of the VIP change, but are
> tolerant and auto-reconnect.
>
> * Copy all queue contents from the master’s queues to the alternate’s
> same-named queues.
>
> * Shut down the master nodes and upgrade them.
>
> * Perform similar steps as above to move the VIP and messages back to the
> master.
>
> --------
>
> While this generally works, there are small but important problems with
> it. One problem is that some of our queues are created/deleted. I have seen
> scenarios where unfortunate timing can cause a queue to not be on the right
> cluster at the right moment.
>
> Looking at newer RabbitMQ features, in particular policy based mirroring
> control, I’m thinking something like this would be better:
>
> -------
>
> * Have node1/node2 running along normally, with the VIP pointing at it.
>
> * Start up node3/node4 in the same cluster
>
> * Use policy to make all queues mirrored by all four nodes. Use
> synch_queue as necessary to force synchronization in a reasonable time
> frame.
>
> * When all queues are synched, remove node1/node2 from the cluster and
> upgrade them.
>
> * Because our VIP is managed by keepalived, either node3 or node4 will
> obtain the VIP when node1/node2 are removed.
>
> --------
>
> The advantage of this is that it eliminates queue location timing windows
> and we don’t have to manually copy messages around. The only downside I’m
> aware of is that it won’t work when upgrading major/minor versions. That
> is, it should be fine from 3.4.3 to 3.4.7, but not 3.4 to 3.5. In that
> case, we'd use our original upgrade logic.
>
> Thoughts? Suggestions? Better ideas?
>
> Thanks,
>
>
> Matt
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131216/9ba0e2ff/attachment.html>