[rabbitmq-discuss] RabbitMQ timing out under small load

Wed Jan 7 22:56:04 GMT 2009

> Do you see timeouts in rabbit.log on the instance which had many
connections during the 
> incident, on the instance which had almost no connections during the
> incident, or on both?

There are timeouts on both machines, unfortunately.

> How many connections does each broker have right now?
There's 3 on one and about 7 on the other, they're theoretically supposed to
be even but in practice it's just spread over the two of them round robin. 

> Essentially, I think we can get some useful insight into the problem if we
> understand why 
> one node during the incident had no connections and another had all of
> them (provided this 
> is not how you guys usually run it).
I agree -- the problem is I vaguely only remember running list_connections
once or twice, the second node always was empty and the first node had 2 or
3 connections. Basically it may or may not have been a coincedence, since
both nodes may have been dropping connections but I snapshotted the thing
using list_connections at a point in time where it just so happened the
second node had none. In hindsight I wish I had done a better job of
monitoring the connections to see if the first node was behaving more sane. 

That said, I was having connectivity problems on both of them for sure, from
running consumers through list_queues. At the end of the day the thing I
would have been able to determine is basically if one node was dropping them
less frequently than others (ie, node 2 may have just simply been
terminating them upon connection, whereas the other one may only have
terminated them once the connection tried to access a queue on node 2.)

> Also, to confirm info from your original report: you saw the problem was
> resolved when you 
> attached a consumer to 2 non-important queues and drained them by
> consuming all messages 
> from them. Did you do it on the node with connections or node without
> connections, and on 
> which nodes did those queues live?
Honestly I am not sure where the queues lived, since I don't think RabbitMQ
can really tell you directly. We basically created them round robin on the
machines before bringing the system up. 

I was not able to drain them by performing get operations due to the
constant timeouts and the fact that the get operations were taking 2-3
seconds a piece. I ended up fixing it by performing a delete queue
operations and then creating the queue over again.

I should have done a bit more documentation of the incident, in retrospect
there's basically some semi-conclusive evidence that it was a single node
breaking and contraevidence that it was a cluster wide phenomenon. For
example, I seem to recall trying to drain the queues first (by basilally
just calling get a number of times), and it seemed that if I went through
the second node I was unable to get anywhere (no get operations ever
completed) but if i went through the first node I could perform get
operations, albeit at a 2-3 second pace. That makes me feel like it was a
node acting weird but it was having an effect on the entire cluster. Next
time this happens (hopefully never) I will try removing the node that
appears to be the troublesome node from the cluster and see if things
improve. 

Here's a potentially more productive question: if I am able to get to the
point where I am sure that rabbitMQ is causing the problem and not some
external TCP factors, is there any other way to see what RabbitMQ is doing
internally other than by looking at the log? 

-- 
View this message in context: http://www.nabble.com/RabbitMQ-timing-out-under-small-load-tp21315520p21342279.html
Sent from the RabbitMQ mailing list archive at Nabble.com.