[rabbitmq-discuss] Pulling RabbitMQ out of service

Fri Feb 4 15:00:09 GMT 2011

On Fri, Feb 4, 2011 at 5:10 AM, Matthew Sackman <matthew at rabbitmq.com>wrote:

> On Mon, Jan 31, 2011 at 07:26:34AM -0800, Bill Moseley wrote:
> > For those of you running multiple RabbitMQ servers in a cluster, what is
> > your procedure when you want to shut one of the servers down (e.g. for
> > maintenance) but not disrupt overall service?   Queues only live on one
> > server so I'm wondering how (or if) you do something to flush out the
> queue
> > before stopping the machine.
>
> Usual best practise is to force clients to reconnect elsewhere,
> recreating the resources they need. This may need some careful thought
> with ordering of events etc. Frequent best practise is that publishers
> create exchanges, and consumers create the queues they need and bind
> them as necessary. To avoid missing any messages you'll need to start up
> new consumers before taking down the old ones.

Thanks for responding -- I had just this morning thought about this question
and here was your response!

Configuration on the clients is something I'm trying to reduce -- as well as
the need to trigger a reconnect on all clients manually.  Maybe what I need
is better configuration management.

One approach I was considering is having all clients (producers and
consumers) connect to a load balancer in front of two independent RabbitMQ
brokers.  They would not be in a cluster (although each one could be its own
cluster of machines for scaling).   The balancer only uses one RabbitMQ
broker and the second is hot-standby.  The consumers connect to a different
IP than the producers.   Clients know to reconnect upon loss of a
connection.

Then the trick is to use the load balancer to make the producers move to the
new broker, move some consumers as well, and leave some other consumers to
drain the queues on the old broker before pulling out of service.

But they must create new
> queues, not on the to-die node. So this requires the queue names must be
> fresh, but then you're going to have to deal with the possibilities of
> duplicate messages during the period that multiple sets of consumers are
> up etc.
>

Renaming the queues is only needed if pulling a machine out of cluster (for
the queues that were created on that machine), correct?  I would not need
that with two separate brokers as I describe above, if I follow what you are
saying.

> The "or-else" routing semantics of RabbitMQ's "Alternate Exchanges" may
> well be of use here.
> http://www.rabbitmq.com/extensions.html#alternate-exchange

Yes, I've been looking at those.  Are you saying that if the queue is not
durable then once all the consumers of that queue go away then could use the
alternate-exchange as a type of fail-over?

> > Now, this is a bit tougher: How about catastrophic failures?  I'm
> wondering
> > about using the complexity of Pacemaker and DRBD vs. tracking incomplete
> > jobs and resubmitting after some time.
>
> Horses for courses really. We know of a number of clients who are using
> the pacemaker stuff, though frequently with NAS/SAN rather than DRBD. If
> you can work out what failures you can withstand and what you can't and
> then pick the best approach to match.
>

The story from our IT department is we don't like DRBD and NAS/SAN is too
expensive. ;)  But, they want to be able to yank the plug on a RabbitMQ box
and have the application continue with no disruption.  Basically, no message
is ever lost.

Yes, the key might be knowing what failures we can withstand.  My current
thinking has shifted a bit.  I think that instead of trying to build
something that never fails, just assume it's not very likely to fail but
design the application to handle a queue failure.  What that means is that
for messages (really "jobs" in this case) that must complete, the producer,
or some agent of the producer (perhaps a cron job) will see the job was not
completed in a timely way and send a new message.

Maybe that's not so pretty if the queue has a million messages pending when
it fails, but if that's the case then there's other more serious
problems....

-- 
Bill Moseley
moseley at hank.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20110204/148f8846/attachment-0001.htm>