[rabbitmq-discuss] RabbitMQ timing out under small load

Tue Jan 6 18:35:18 GMT 2009

Thanks!

I suspect a funky mix of EC2, TCP, latency, and JVM5 is to blame,
manifested as non-determinacy in your Rabbit cluster.

One of the guys on this list is an EC2 expert and I hope he will
comment before we dig deeper into the Rabbit side...

On Tue, Jan 6, 2009 at 6:16 PM, gfodor <gfodor at gmail.com> wrote:
>
> Not really, we're using the standard Ubuntu AMIs out of the box, and actually
> just set up the rabbitmq brokers by hand on the machines.
>
> We're using the 32-bit hardy image from here:
>
> http://alestic.com/
>
> There is one other issue we had been running into. Basically we're on JVM5
> and we were running into the issue outlined here:
>
> http://www.nabble.com/TCP-timeouts-td17102781.html#a17102781
>
> so we upped our heartbeat value to 30 seconds, which I realize is high, but
> the problem went away. I wasn't sure (doubtful) this could affect things,
> but in the interest of full disclosure there it is :)
>
>
> Alexis Richardson-2 wrote:
>>
>> Can you tell us a bit more about the EC2 set up please - is there
>> anything unusual about how you built and connect to your AMIs?
>>
>> alexis
>>
>>
>> On Tue, Jan 6, 2009 at 5:58 PM, gfodor <gfodor at gmail.com> wrote:
>>>
>>> We had an incident yesterday while restarting some servers where RabbitMQ
>>> began basically being dysfunctional -- we were getting connection
>>> timeouts
>>> in our log everywhere, and were not even able to perform a rabbitmqctl
>>> list_queues without it timing out 70-80% of the time.
>>>
>>> Here's our setup. We have two RabbitMQ nodes running on EC2 running
>>> Ubuntu
>>> hardy. They have four queues, all of them durable, each with <1kb message
>>> sizes. Yesterday we were doing a deployment so naturally these queues
>>> filled
>>> up with backlog as the servers responsible for processing them were
>>> temporarily turned off. We ran into a snag in the deployment, so this
>>> downtime was extended, which resulted in the queues backlogging more than
>>> we've had to do in the past.
>>>
>>> At the time of the incident, both nodes were displaying the same
>>> behavior,
>>> though anecdotally it appeared that one node was slightly more responsive
>>> than the other. list_queues was timing out and our consumers (once
>>> brought
>>> back up) were throwing connection timeout errors left and right. We also
>>> noticed that the consumer threads that did not timeout would in fact have
>>> to
>>> wait 2-3 seconds on a synchronous call to get before receiving the next
>>> message from the queue. We were pushing approximately 30-60 items per
>>> second
>>> onto the 4 queues in total. At the time of the failures, the persister
>>> log
>>> was around 25mb, and I was able to see using list_queues that the queue
>>> sizes in total was approximately 50MB. RAM wise, rabbit was using
>>> approximately 30-40MB on each machine and there were intermittent CPU
>>> spikes
>>> but generally the CPU was largely idle. (Note that we initialize these
>>> queues in a manner that we are sure they are distributed over the nodes
>>> evenly, half the queues are 'owned' by node 1 and the other by node 2, so
>>> the RAM usage makes sense.) Using iostat, the disk did not seem to be
>>> under
>>> much load, and now that things are nominal it looks like the disk load is
>>> the same as when things were broken (about 30 blk/s writes.)
>>>
>>> One thing to notice obviously is this isn't a very high workload, so we
>>> are
>>> pretty perplexed why things basically died after our queue backed up for
>>> an
>>> hour. We managed to fix the problem by basically purging the two larger
>>> queues (after several attempts), and things basically snapped out of it.
>>> Fortunately for us the data on these queues didn't really need to be
>>> retained, and we were able to salvage the data on the other two queues in
>>> this case which is the real data we need durable.
>>>
>>> The rabbit.log didn't reveal anything useful, it just simply showed the
>>> regular persister log rotating messages missed in with timeout errors
>>> such
>>> as the following:
>>>
>>> =WARNING REPORT==== 6-Jan-2009::02:20:42 ===
>>> Non-AMQP exit reason '{timeout,
>>>                          {gen_server,
>>>                              call,
>>>
>>> [<8656.465.0>,{basic_get,<0.14214.1>,false}]}}'
>>>
>>> We're obviously deeply concerned by this incident, particularly since
>>> RabbitMQ was nowhere near exhausting memory or I/O capacity and simply
>>> seemed to be having internal contention issues or garbage collection
>>> issues
>>> or something going on preventing it from being able to handle our light
>>> workload once our consumers came back online. Any help you can give is
>>> greatly appreciated!
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/RabbitMQ-timing-out-under-small-load-tp21315520p21315520.html
>>> Sent from the RabbitMQ mailing list archive at Nabble.com.
>>>
>>>
>>> _______________________________________________
>>> rabbitmq-discuss mailing list
>>> rabbitmq-discuss at lists.rabbitmq.com
>>> http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>>
>>
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> rabbitmq-discuss at lists.rabbitmq.com
>> http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>
>>
>
> --
> View this message in context: http://www.nabble.com/RabbitMQ-timing-out-under-small-load-tp21315520p21315877.html
> Sent from the RabbitMQ mailing list archive at Nabble.com.
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>