[rabbitmq-discuss] HA - missing or incompletely replicated queues

Tue Nov 8 00:24:54 GMT 2011

Hi Ashley,

On Mon, Nov 07, 2011 at 11:46:26PM +0000, Ashley Brown wrote:
> Effectively:
> 
> 5 HA queues, replicated to all nodes. 3 nodes in the cluster, 15GB
> machines, high water mark at 5.9GB.
> 
> Start producers and consumers, consumers running slowly, allowing queues to
> build to 500k.
> 
> Stop producers and consumers, delete queues. Deletions take a long time.
> Although I've also seen this with no-HA queues - it can take tens of
> minutes to delete a queue with 250k+ messages in.

Ok, if it's that simple to trigger then it shouldn't take too long for
me to reproduce it. I've certainly seen the slaves take some time to
catch up - the master->slave communication is done asynchronously
deliberately to stop the slowest-slave dictating the pace, and it can
create such backlogs, but we've introduced shortcuts for queue deletion
elsewhere, and I'd imagine we could do the same in a few places in the
HA queues stuff.

> Previously we've had queues in a steady state, with approximately 25,000
> unacked messages (they take several minutes to process, aren't acked until
> complete). Then kill some nodes off, forcing the messages to be requeued
> and replayed on the slaves.

Argh. That number of unacked messages might be a problem. I'm not sure
at this stage, but it's possible there are some places where we're just
using the wrong datastructures to deal with that sort of scenario. I'll
take a look after a snooze.

> It all gets out of sync after that.

Well, it shouldn't. I still suspect it might eventually recover, but it
might take the heat-death of the universe before it gets there.

> I might be able to give you a better test case once we've pushed our non-HA
> release out and I have a bit more time.

That'd be very kind, but I'll prod it myself in the meantime.

Thanks,

Matthew