[rabbitmq-discuss] RabbitMQ timing out under small load

Tue Jan 6 17:58:21 GMT 2009

We had an incident yesterday while restarting some servers where RabbitMQ
began basically being dysfunctional -- we were getting connection timeouts
in our log everywhere, and were not even able to perform a rabbitmqctl
list_queues without it timing out 70-80% of the time.

Here's our setup. We have two RabbitMQ nodes running on EC2 running Ubuntu
hardy. They have four queues, all of them durable, each with <1kb message
sizes. Yesterday we were doing a deployment so naturally these queues filled
up with backlog as the servers responsible for processing them were
temporarily turned off. We ran into a snag in the deployment, so this
downtime was extended, which resulted in the queues backlogging more than
we've had to do in the past.

At the time of the incident, both nodes were displaying the same behavior,
though anecdotally it appeared that one node was slightly more responsive
than the other. list_queues was timing out and our consumers (once brought
back up) were throwing connection timeout errors left and right. We also
noticed that the consumer threads that did not timeout would in fact have to
wait 2-3 seconds on a synchronous call to get before receiving the next
message from the queue. We were pushing approximately 30-60 items per second
onto the 4 queues in total. At the time of the failures, the persister log
was around 25mb, and I was able to see using list_queues that the queue
sizes in total was approximately 50MB. RAM wise, rabbit was using
approximately 30-40MB on each machine and there were intermittent CPU spikes
but generally the CPU was largely idle. (Note that we initialize these
queues in a manner that we are sure they are distributed over the nodes
evenly, half the queues are 'owned' by node 1 and the other by node 2, so
the RAM usage makes sense.) Using iostat, the disk did not seem to be under
much load, and now that things are nominal it looks like the disk load is
the same as when things were broken (about 30 blk/s writes.) 

One thing to notice obviously is this isn't a very high workload, so we are
pretty perplexed why things basically died after our queue backed up for an
hour. We managed to fix the problem by basically purging the two larger
queues (after several attempts), and things basically snapped out of it.
Fortunately for us the data on these queues didn't really need to be
retained, and we were able to salvage the data on the other two queues in
this case which is the real data we need durable. 

The rabbit.log didn't reveal anything useful, it just simply showed the
regular persister log rotating messages missed in with timeout errors such
as the following:

=WARNING REPORT==== 6-Jan-2009::02:20:42 ===
Non-AMQP exit reason '{timeout,
                          {gen_server,
                              call,

[<8656.465.0>,{basic_get,<0.14214.1>,false}]}}'

We're obviously deeply concerned by this incident, particularly since
RabbitMQ was nowhere near exhausting memory or I/O capacity and simply
seemed to be having internal contention issues or garbage collection issues
or something going on preventing it from being able to handle our light
workload once our consumers came back online. Any help you can give is
greatly appreciated!

-- 
View this message in context: http://www.nabble.com/RabbitMQ-timing-out-under-small-load-tp21315520p21315520.html
Sent from the RabbitMQ mailing list archive at Nabble.com.