[rabbitmq-discuss] Pulling RabbitMQ out of service

Fri Feb 4 15:19:11 GMT 2011

On Fri, Feb 04, 2011 at 07:00:09AM -0800, Bill Moseley wrote:
> One approach I was considering is having all clients (producers and
> consumers) connect to a load balancer in front of two independent RabbitMQ
> brokers.  They would not be in a cluster (although each one could be its own
> cluster of machines for scaling).   The balancer only uses one RabbitMQ
> broker and the second is hot-standby.  The consumers connect to a different
> IP than the producers.   Clients know to reconnect upon loss of a
> connection.
> 
> Then the trick is to use the load balancer to make the producers move to the
> new broker, move some consumers as well, and leave some other consumers to
> drain the queues on the old broker before pulling out of service.

Yup, that approach should work well.

> Renaming the queues is only needed if pulling a machine out of cluster (for
> the queues that were created on that machine), correct?  I would not need
> that with two separate brokers as I describe above, if I follow what you are
> saying.

Correct.

> > The "or-else" routing semantics of RabbitMQ's "Alternate Exchanges" may
> > well be of use here.
> > http://www.rabbitmq.com/extensions.html#alternate-exchange
> 
> Yes, I've been looking at those.  Are you saying that if the queue is not
> durable then once all the consumers of that queue go away then could use the
> alternate-exchange as a type of fail-over?

The unfortunate thing about ae is that it can't be dynamically set on an
exchange.

So in a cluster, if you start with an exchange X with an ae of Y, then
you could have all your clients create exclusive queues and bind them to
X.

At the point you want to move over, you create exchange Y, with ae of Z,
add all the new clients (leave the old running) and the new clients
create new exclusive queues and bind to Y. At this point, the new
clients will not be receiving any messages at all.

Then gracefully shut down your old clients. They should remove their
bindings explicitly to X and then make sure their queues are empty
before disconnecting. Atomically, as soon as they remove their bindings
from X, the ae will kick in and messages will be routed now to Y, which
then shovel them off to the new clients. Provided you do the binding
deletion and queue drain carefully, you should be able to guarantee no
message loss.

Then on the next upgrade step, you're just repeating the process but
with Z for Y. The problem though is that you'll end up with an infinite
change of exchanges X -> Y -> Z -> ...

If you rethink it all, you'll be able to see that you can ensure there's
always just 2 hops by using exchange to exchange bindings. I.e. you
start with A -> X -> [queues]. Then you add Y -> [new queues], then you
add A -> Y and remove A -> X. The only issue here is that those last two
steps cannot be done atomically, so there'll be a window where both new
and old queues can get the same message, or messages can be dropped,
depending on how you order events. But effectively A is abstracting over
which version of the underlying routing topology you're using. It's
really quite powerful, though note that exchange-to-exchange bindings
are also a rabbit-only feature atm.

http://www.rabbitmq.com/extensions.html#exchange-bindings

> The story from our IT department is we don't like DRBD and NAS/SAN is too
> expensive. ;)  But, they want to be able to yank the plug on a RabbitMQ box
> and have the application continue with no disruption.  Basically, no message
> is ever lost.

Then tell them where to sign ;)

> Yes, the key might be knowing what failures we can withstand.  My current
> thinking has shifted a bit.  I think that instead of trying to build
> something that never fails, just assume it's not very likely to fail but
> design the application to handle a queue failure.  What that means is that
> for messages (really "jobs" in this case) that must complete, the producer,
> or some agent of the producer (perhaps a cron job) will see the job was not
> completed in a timely way and send a new message.

Yup. Idempotency of operations here will make your life easier, as will
things like publisher confirms and/or transactions.

> Maybe that's not so pretty if the queue has a million messages pending when
> it fails, but if that's the case then there's other more serious
> problems....

Exactly.

Matthew