[rabbitmq-discuss] Application architecture question: queue failure
Tim Watson
tim at rabbitmq.com
Wed Jun 13 09:53:07 BST 2012
On 06/06/2012 14:05, Bill Moseley wrote:
> We try and design our architecture (for a large web application) in a
> way that we expect any part to fail: "design for failure". There's
> parts of our system that are prime for RabbitMQ but the concern is that
> a message must never be lost -- even when RabbitMQ is setup for HA.
>
Hi Bill. When you do not want to loose messages then RabbitMQ has
support for ensuring that messages are on-disk and that the producer
knows this for sure. When rabbit is set up this way, no message will be
lost and we would consider it a serious bug if any were (and would fix
the problem as soon as humanly possible!) - so I think having the right
set up for this should probably be your first port of call.
The first step is to declare your queue as durable, ensuring it will
survive a broker crash. Secondly, you should indicate that your messages
are persistent, instructing the broker to put them on disk so they too
will survive any crash.
The next step is to set up producer confirms on the channel you're
publishing to. The broker confirms it has taken responsibility for these
messages by sending a basic.ack on the same channel. Persistent messages
are confirmed when all queues have either delivered the message and
received an acknowledgement, or persisted the message (which includes
flushing the data to disk). It is important to make sure your consumers
have to ack messages to ensure no information can potentially be lost in
transit.
As per the documentation on publisher confirms (see
http://www.rabbitmq.com/extensions.html#confirms), the broker will send
a basic.nack if it cannot, for some reason handle the message. The
publisher can then decide what to do.
When you now set up your cluster, you can use HA (active/active mirror)
queues to make sure that each node in the cluster behaves in this way
with regard to handling messages. If the master node crashes, any
message that has been ack'ed is going to be safely persisted on a mirror
queue on one of the neighboring nodes.
> So, designing for that rare situation that a message might get lost, my
> approach has been to maintain state on the application. When I send a
> message to get some work down I flag it as "in process" or "pending"
> with a start time and a retry counter. I can then (say with cron) find
> the uncompleted tasks that have been waiting for some value of too long.
>
> But, then the problem is what to do with that information? How do I
> know that the message is really lost and not just backed up in the
> queue? Don't want to queue it again in this case as it just compounds
> the problem (and then if the first job finally completes the state of
> the wrong message is updated).
>
To my mind, this comes down entirely to how you process messages in the
consumer. If a persistent message is written to a durable queue, the
channel is set up to use publisher confirms, and the message has been
ack'ed by the broker it is *definitely* not lost: it is on disk at this
point.
In AMQP-0.9.1 the issue of whether or not a job has been processed (by a
consumer) already is entirely dependent on the client. You can set up a
system for tracking jobs by unique id (uuid, or whatever) in the
consumer, to ensure that a job isn't processed twice. You should
probably read
http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2010-August/008271.html
to get a feel for why this requires explicit effort by the consumers.
> That, and I frankly think the overhead of my state tracking is possibly
> more problematic than the potential for a loss of a message.
>
Well I'd like to know where your concerns about loss of messages comes
from. As with anything built by humans, there is always *some* risk of
things going wrong with any software, but rabbit does a *lot* of work to
make sure this doesn't ever happen. If you're using publisher confirms
and clustering (which latter technique choosing the Consistency and
Availability parts of the CAP theorem), then you should not have to
worry about this too much, especially in a 'work-queue' scenario, which
this sounds like.
> Anyway, sorry if this is a mundane (if not a bit off-topic) question --,
> and I know it's application-specific. But, it's a question that comes up
> often in our design discussions.
>
> Do you have these concerns and how do you handle the possibility of
> message or queue loss?
>
Well rabbit fight's like a cornered... rabbit ... to make sure this
doesn't happen! ;)
Please feel free to elaborate on your concerns and questions, as that's
what the list is for! I'd certainly like to understand a bit more about
how your application works, what constitutes a job and how these are
identified throughout the system. I often find the issue of identity can
be particularly vexing in any non-trivial architecture.
More information about the rabbitmq-discuss
mailing list