[rabbitmq-discuss] RabbitMQ timing out under small load

Tue Jan 6 19:35:24 GMT 2009

Hi there,

Please note that I personally haven't observed symptoms that you describe, so I can't 
speak from practical experience, only theoretical.

gfodor wrote:
> We had an incident yesterday while restarting some servers where RabbitMQ
> began basically being dysfunctional -- we were getting connection timeouts
> in our log everywhere, and were not even able to perform a rabbitmqctl
> list_queues without it timing out 70-80% of the time.
Was there anything to suggest overall network latency or connection timeouts or connection 
issues to other services running on the hosts where rabbitmq was running? In other words, 
is it possible that the issues were related to the host and/or OS?

Also, to confirm, you are running 2 rabbitmq brokers on 2 different EC2 instances (1 
broker per instance) and the problem happened on both instances, both brokers at the same 
time and brokers are not in a rabbitmq cluster, right?

> 
> Here's our setup. We have two RabbitMQ nodes running on EC2 running Ubuntu
> hardy. They have four queues, all of them durable, each with <1kb message
> sizes. Yesterday we were doing a deployment so naturally these queues filled
> up with backlog as the servers responsible for processing them were
> temporarily turned off. We ran into a snag in the deployment, so this
> downtime was extended, which resulted in the queues backlogging more than
> we've had to do in the past.
> 
> At the time of the incident, both nodes were displaying the same behavior,
> though anecdotally it appeared that one node was slightly more responsive
> than the other. list_queues was timing out and our consumers (once brought
> back up) were throwing connection timeout errors left and right. We also

Can you replay your deployment scenario (maybe not in production, of course) and check if 
you get the same problem? Alternatively, if I were to recreate an edge case of your 
scenario, I assume I'll need N producers sending messages with size=1K to the same queue 
(say amq.direct exchange, routing_key=foo) at aggregate rate 60 messages per second for 1 
hour without consuming, and the expected result is that broker will enter the bad state? 
What value corresponds to N in your situation?

Which version of rabbitmq were you running and on which erlang?

Did you have a chance to capture any system runtime info (netstat, tcpdump, etc) while 
this was happening?

> One thing to notice obviously is this isn't a very high workload, so we are
> pretty perplexed why things basically died after our queue backed up for an
> hour. We managed to fix the problem by basically purging the two larger
> queues (after several attempts), and things basically snapped out of it.
So you drained the queues by attaching a consumer to them? Did you restart brokers?

Also wondering if anybody on the list observed something similar, not necessarily in EC2.

- Dmitriy