[rabbitmq-discuss] Examining Queue Failover Behavior

Thu Feb 12 11:32:25 GMT 2009

Jason,

On Wed, Feb 11, 2009 at 7:57 PM, Jason J. W. Williams
<jasonjwwilliams at gmail.com> wrote:
> Well, my only argument would be that a persistent queue doesn't
> guarantee a whole lot if the message never makes it there.

True. But this is fundamentally a tricky problem to solve without
tightly coupling producers and consumers. In general I think that it
is better for the messaging infrastructure to provide the application
with the primitives it needs to get the desired level of service,
because the application knows more about the application that the
broker does (what this could mean is exemplified below). Basically
what I mean is that exploiting application specific knowledge can help
prevent turning a messaging broker into a database.

> The problem
> it seems to me depends on the application. If the producer has
> pre-knowledge that the message its entrusting to the MQ is important,
> it needs to be able top specify a level of persistence in case the MQ
> crashes before the exchange can route it.

True and ties in with the above - a blunt yet effective solution would
be to set up some kind of archiving consumer that keeps a copy of each
published message in some kind of store - some people are using
CouchDB for this kind of thing. This pushes the reliability guarantee
up to a layer that can decide what is important and what is not, thus
keeping the core lean and mean.

> Beyond that, I think its up
> to the consumer to make sure here's a queue available for the message.
> But in a failure environment, since exchanges fail over, but queues do
> not, you now have an issue where producers may publish messages before
> consumers can re-attach and recreate the queues. Where as initially,
> neither the exchanges nor the queues would exist until the consumers
> created them, thereby preventing the producers from publishing into
> ether. As I write this out, it now strikes me that that is the crux of
> our issue: Exchange metadata fails over automatically but not queues.

This is a fair point and ties in with my previous comments about

a) it being tricky to determine a distributed consensus;
b) the questionable semantics of queue deletes for subscriptions
according to the spec;
b) using AMQP events to allow an application to react to changes in the system;

> Regarding RAS, RAS = Reliability, Availability, Serviceability:
> http://en.wikipedia.org/wiki/Reliability,_Availability_and_Serviceability

Good to know what TLAs Rabbit is being benchmarked against - in light
of the recent discussion on what a transaction actually is, you could
even suggest this definition to the working group :-)

Ben