[rabbitmq-discuss] RabbitMQ timing out under small load

Tue Jan 6 18:08:28 GMT 2009

Can you tell us a bit more about the EC2 set up please - is there
anything unusual about how you built and connect to your AMIs?

alexis

On Tue, Jan 6, 2009 at 5:58 PM, gfodor <gfodor at gmail.com> wrote:
>
> We had an incident yesterday while restarting some servers where RabbitMQ
> began basically being dysfunctional -- we were getting connection timeouts
> in our log everywhere, and were not even able to perform a rabbitmqctl
> list_queues without it timing out 70-80% of the time.
>
> Here's our setup. We have two RabbitMQ nodes running on EC2 running Ubuntu
> hardy. They have four queues, all of them durable, each with <1kb message
> sizes. Yesterday we were doing a deployment so naturally these queues filled
> up with backlog as the servers responsible for processing them were
> temporarily turned off. We ran into a snag in the deployment, so this
> downtime was extended, which resulted in the queues backlogging more than
> we've had to do in the past.
>
> At the time of the incident, both nodes were displaying the same behavior,
> though anecdotally it appeared that one node was slightly more responsive
> than the other. list_queues was timing out and our consumers (once brought
> back up) were throwing connection timeout errors left and right. We also
> noticed that the consumer threads that did not timeout would in fact have to
> wait 2-3 seconds on a synchronous call to get before receiving the next
> message from the queue. We were pushing approximately 30-60 items per second
> onto the 4 queues in total. At the time of the failures, the persister log
> was around 25mb, and I was able to see using list_queues that the queue
> sizes in total was approximately 50MB. RAM wise, rabbit was using
> approximately 30-40MB on each machine and there were intermittent CPU spikes
> but generally the CPU was largely idle. (Note that we initialize these
> queues in a manner that we are sure they are distributed over the nodes
> evenly, half the queues are 'owned' by node 1 and the other by node 2, so
> the RAM usage makes sense.) Using iostat, the disk did not seem to be under
> much load, and now that things are nominal it looks like the disk load is
> the same as when things were broken (about 30 blk/s writes.)
>
> One thing to notice obviously is this isn't a very high workload, so we are
> pretty perplexed why things basically died after our queue backed up for an
> hour. We managed to fix the problem by basically purging the two larger
> queues (after several attempts), and things basically snapped out of it.
> Fortunately for us the data on these queues didn't really need to be
> retained, and we were able to salvage the data on the other two queues in
> this case which is the real data we need durable.
>
> The rabbit.log didn't reveal anything useful, it just simply showed the
> regular persister log rotating messages missed in with timeout errors such
> as the following:
>
> =WARNING REPORT==== 6-Jan-2009::02:20:42 ===
> Non-AMQP exit reason '{timeout,
>                          {gen_server,
>                              call,
>
> [<8656.465.0>,{basic_get,<0.14214.1>,false}]}}'
>
> We're obviously deeply concerned by this incident, particularly since
> RabbitMQ was nowhere near exhausting memory or I/O capacity and simply
> seemed to be having internal contention issues or garbage collection issues
> or something going on preventing it from being able to handle our light
> workload once our consumers came back online. Any help you can give is
> greatly appreciated!
>
> --
> View this message in context: http://www.nabble.com/RabbitMQ-timing-out-under-small-load-tp21315520p21315520.html
> Sent from the RabbitMQ mailing list archive at Nabble.com.
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>