[rabbitmq-discuss] RabbitMQ timing out under small load

gfodor gfodor at gmail.com
Tue Jan 6 18:16:25 GMT 2009


Not really, we're using the standard Ubuntu AMIs out of the box, and actually
just set up the rabbitmq brokers by hand on the machines.

We're using the 32-bit hardy image from here:

http://alestic.com/

There is one other issue we had been running into. Basically we're on JVM5
and we were running into the issue outlined here:

http://www.nabble.com/TCP-timeouts-td17102781.html#a17102781

so we upped our heartbeat value to 30 seconds, which I realize is high, but
the problem went away. I wasn't sure (doubtful) this could affect things,
but in the interest of full disclosure there it is :)


Alexis Richardson-2 wrote:
> 
> Can you tell us a bit more about the EC2 set up please - is there
> anything unusual about how you built and connect to your AMIs?
> 
> alexis
> 
> 
> On Tue, Jan 6, 2009 at 5:58 PM, gfodor <gfodor at gmail.com> wrote:
>>
>> We had an incident yesterday while restarting some servers where RabbitMQ
>> began basically being dysfunctional -- we were getting connection
>> timeouts
>> in our log everywhere, and were not even able to perform a rabbitmqctl
>> list_queues without it timing out 70-80% of the time.
>>
>> Here's our setup. We have two RabbitMQ nodes running on EC2 running
>> Ubuntu
>> hardy. They have four queues, all of them durable, each with <1kb message
>> sizes. Yesterday we were doing a deployment so naturally these queues
>> filled
>> up with backlog as the servers responsible for processing them were
>> temporarily turned off. We ran into a snag in the deployment, so this
>> downtime was extended, which resulted in the queues backlogging more than
>> we've had to do in the past.
>>
>> At the time of the incident, both nodes were displaying the same
>> behavior,
>> though anecdotally it appeared that one node was slightly more responsive
>> than the other. list_queues was timing out and our consumers (once
>> brought
>> back up) were throwing connection timeout errors left and right. We also
>> noticed that the consumer threads that did not timeout would in fact have
>> to
>> wait 2-3 seconds on a synchronous call to get before receiving the next
>> message from the queue. We were pushing approximately 30-60 items per
>> second
>> onto the 4 queues in total. At the time of the failures, the persister
>> log
>> was around 25mb, and I was able to see using list_queues that the queue
>> sizes in total was approximately 50MB. RAM wise, rabbit was using
>> approximately 30-40MB on each machine and there were intermittent CPU
>> spikes
>> but generally the CPU was largely idle. (Note that we initialize these
>> queues in a manner that we are sure they are distributed over the nodes
>> evenly, half the queues are 'owned' by node 1 and the other by node 2, so
>> the RAM usage makes sense.) Using iostat, the disk did not seem to be
>> under
>> much load, and now that things are nominal it looks like the disk load is
>> the same as when things were broken (about 30 blk/s writes.)
>>
>> One thing to notice obviously is this isn't a very high workload, so we
>> are
>> pretty perplexed why things basically died after our queue backed up for
>> an
>> hour. We managed to fix the problem by basically purging the two larger
>> queues (after several attempts), and things basically snapped out of it.
>> Fortunately for us the data on these queues didn't really need to be
>> retained, and we were able to salvage the data on the other two queues in
>> this case which is the real data we need durable.
>>
>> The rabbit.log didn't reveal anything useful, it just simply showed the
>> regular persister log rotating messages missed in with timeout errors
>> such
>> as the following:
>>
>> =WARNING REPORT==== 6-Jan-2009::02:20:42 ===
>> Non-AMQP exit reason '{timeout,
>>                          {gen_server,
>>                              call,
>>
>> [<8656.465.0>,{basic_get,<0.14214.1>,false}]}}'
>>
>> We're obviously deeply concerned by this incident, particularly since
>> RabbitMQ was nowhere near exhausting memory or I/O capacity and simply
>> seemed to be having internal contention issues or garbage collection
>> issues
>> or something going on preventing it from being able to handle our light
>> workload once our consumers came back online. Any help you can give is
>> greatly appreciated!
>>
>> --
>> View this message in context:
>> http://www.nabble.com/RabbitMQ-timing-out-under-small-load-tp21315520p21315520.html
>> Sent from the RabbitMQ mailing list archive at Nabble.com.
>>
>>
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> rabbitmq-discuss at lists.rabbitmq.com
>> http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>
> 
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
> 
> 

-- 
View this message in context: http://www.nabble.com/RabbitMQ-timing-out-under-small-load-tp21315520p21315877.html
Sent from the RabbitMQ mailing list archive at Nabble.com.





More information about the rabbitmq-discuss mailing list