[rabbitmq-discuss] Weird Crash - Recovery logic for durable messages/queues/exchanges?

Fri Aug 7 19:20:51 BST 2009

Darien,

Darien Kindlund wrote:
> Understood.  Yes, I've checked and there were no
> connection/re-connection attempts, when I had witnessed the the
> persistent messages in the durable queue were still marked as
> un-ack'd.

And you are sure that the server actually restarted?

Also, how easy is it to reproduce this problem? Does it happen with,
say, a clean installation (empty db dir, no log files) when you publish
a few persistent messages, consume (but not ack) them, and then restart
the broker?

If you can construct such a test case for us then we'll try to replicate
it on our systems.

> Okay, so 'queue.purge' will flush all 'ready' and 'un-ack'd' messages
> from a particular queue -- or just un-ack'd messages?

Good question :) In AMQP 0-8 the spec requires that queue.purge removes
*all* messages. In 0-9-1 this got changed to the more sensible "all
messages not awaiting acknowledgement", and that is what RabbitMQ
implements.

> Is there a command in the AMQP spec that will instruct RabbitMQ to re-mark all
> un-ack'd messages as ready?

There have been several discussions about this on the mailing list.
AMQP's 'basic.reject' command, which, if RabbitMQ implemented it (which
it doesn't, yet) would allow a client to reject, and thus make available
to other consumers, specific messages it has received.

But that falls short of what you are after, since you want some agent
other than the consumer to initiate the reclaim of messages.

> If no such command exists, I'm thinking it would be useful to include
> such a command in future versions of the spec, so that people could
> develop 'message recovery logic', when dealing with buggy consumers
> that are connected but are not actually properly processing the
> messages.

Do you want to have a stab at defining the syntax and semantics of this
new command? Take a look at the 0-9-1 xml spec
(http://jira.amqp.org/confluence/download/attachments/720900/amqp0-9-1.xml),
to get an idea of the flavour in which AMQP commands are defined, and
follow that as closely as possible.

I'd be happy to review and discuss your proposal.

One issue you are going to have to think about is what to do with
acknowledgments sent by the original consumer for a message that has
been "reclaimed". Do we treat the message as ack'ed at that point? Or
not? Should the ack fail (as it would if a consumer tried to ack a
message it didn't receive)? If we treat the message as ack'ed, what then
happens when another consumer to which the reclaimed message was sent
tries to ack it?

> If I run into this issue in the future, would it help if I could
> provide you a copy of the mnesia directory once RabbitMQ has
> unexpectedly crashed?

Possibly. Though looking at the code I am struggling to see how mnesia
or the persister could have anything to do with the behaviour you are
observing. You see, messages don't actually carry an 'unacknowledged'
mark. Instead when a message is sent to a consumer it is moved to a
different part of the state of the queue process, associated with that
particular consumer. rabbitmqctl quite literally counts the messages in
all these so-called consumer records to determined the unack'ed message
count. The consumer records do not survive a server restart. So for the
count to return a non-zero number after a restart the queue process must
have created some consumer records, which means some consumers must have
connected. Now, there could of course be all kinds of bugs lurking in
the code, including the queue processes conjuring up consumer records
from thin air, but atm I cannot see how a bug in the persistence
mechanism would result in the behaviour you have described.

Regards,

Matthias