[rabbitmq-discuss] HA active/active cluster in a bad state

Matthew Sackman matthew at rabbitmq.com
Thu Oct 13 16:33:29 BST 2011


Hi Bryan,

Sorry it's taken so long to get back to you... been swamped with other
things.

On Tue, Oct 04, 2011 at 08:10:09PM -0500, Bryan Murphy wrote:
> From another node:
> 
> Cluster status of node 'rabbit at domU-12-31-39-06-72-50' ...
> [{nodes,[{disc,['rabbit at domU-12-31-38-07-18-A6','rabbit at ip-10-202-209-83',
>                 'rabbit at domU-12-31-39-06-72-50']}]},
>  {running_nodes,['rabbit at domU-12-31-38-07-18-A6','rabbit at ip-10-202-209-83',
>                  'rabbit at domU-12-31-39-06-72-50']}]
> ...done.
> 
> rabbitmqctl list_queues has the same behavior on the other nodes (never
> returns).

Ok, that most likely suggests the master queue is stuck for some reason,
which is remarkably odd.

> > How big were the queues? We recently fixed some bugs which had
> > previously been causing queue recovery to take a _very_ long time so it
> > might be one of those that's afflicting you. What is the CPU/disk doing
> > of the "stuck" node? If it's spinning then it's probably just taking a
> > very long time to recover.
> 
> Maybe 10-20 queues, probably about 15 messages queued at the time.  This
> environment is a very *low* volume but very *critical* part of our
> application.  I'd be surprised if the production servers saw more than a
> couple hundred messages total per day and this was our test environment.
> 
> Right now the node is idle and it's been sitting there for four hours:
> 
>  01:08:33 up 8 days,  5:33,  1 user,  load average: 0.00, 0.01, 0.05
> 
> A bunch of messages were in an allocated but not-acked state as I was
> testing some of our server processes at the time and they were crashing
> before they could ack the messages.  I originally went to restart the node
> to try and get those messages flowing again.

Right, there's nothing there that could possibly offer any explanation
as to why this has gone wrong. How repeatable is this, and is there a
simple set of steps that would allow us to recreate this? Failing that,
if you can repeat this, is there any way you could grant me (or one of
my colleagues) access to the machines running Rabbit so that we can
probe your rabbits in a variety of ways?

Best wishes,

Matthew


More information about the rabbitmq-discuss mailing list