[rabbitmq-discuss] Database Corruption Possibilities

Matthew Sackman matthew at rabbitmq.com
Wed Jun 15 12:23:42 BST 2011


On Fri, Jun 03, 2011 at 12:21:07PM +0100, Ozan Seymen wrote:
> Can someone please explain the scenarios where we might have Rabbit
> message storage (is it mnesia?) corrupted in a way that it is not
> recoverable?

As stated elsewhere, Rabbit does not store messages in mnesia. That
doesn't mean that mnesia being damaged won't cause problems - if you
take a hex editor to its files, I'm sure you can do much damage. Whether
or not mnesia will start up after that, I'm not sure.

Rabbit's own on disk format is very simple, and recovery is done on a
best effort basis. Rabbit is able to fsck its queues and msg stores if
necessary on start up to ensure that messages are where they're expected
to be, and Rabbit will throw away corrupted files as it finds them. This
may result, in extremis, in Rabbit being able to recover no messages
from disk, but it'll still start up happily. It doesn't care too much
that it can't read messages from disk; it'll just tidy up and get on
with life.

> In the solution I am working on, I simply cannot afford to lose any
> messages. In order to secure this, I will:
> 
> *         Rely on publisher confirms. This should ensure that broker
> will always confirm whether it assumed responsibility and persisted
> the message.

Yes-ish. If the message is persistent and sent to a durable queue, and
the message isn't consumed (and ack'd), then the confirm will indicate
it's been fsync'd to disk.

> *         Ack enabled in consumers to prevent losing messages if
> consumer dies halfway. I will solve the ordering problem on the
> consumer side.

Ok. It's not just an ordering problem though; you also will need to
detect duplicates.

> Even though all of these above prevent message loss in normal
> conditions, none of them covers the case where data gets corrupted in
> the broker.

Indeed not. There are dozens of other places where messages can get
corrupted too. Faulty RAM, for example, dying hard discs,
man-in-the-middle attacks that rewrite your messages on the fly. etc
etc.

> There is a window (albeit small) that things might go
> wrong: broker assumes responsibility (message is in the disk) and
> before message is sent to the consumer, broker experiences problems
> which corrupts the storage.

This is remarkably unlikely, and whilst I could argue long and hard
about how the design of the persister should prevent this from ever
happening, I've not proved the correctness of the kernel disk drivers,
file system, or even Erlang's own file driver. Thus there could, and
probably are, bugs in all of those that could cause such corruption.

> Am I a total paranoid that is beyond help? Even so, I would really
> appreciate any info you guys can share.

I don't think you're paranoid, but all (well, nearly all) reasonably
complex software has bugs. Maybe consider taking out insurance against
message loss? Otherwise, IMO, it's far better to build your systems with
message loss in mind, rather than trying to avoid it.


Best wishes,

Matthew


More information about the rabbitmq-discuss mailing list