[rabbitmq-discuss] RabbitMQ timing out under small load

gfodor gfodor at gmail.com
Tue Jan 6 20:21:56 GMT 2009


Hey Dmitriy,

> Was there anything to suggest overall network latency or connection
> timeouts or connection 
> issues to other services running on the hosts where rabbitmq was running?
> In other words, 
> is it possible that the issues were related to the host and/or OS?

No, I was ssh'ed in and there was no odd latency on my connection.
Additionally, the rabbitmqctl list_queues command was timing out, so I'm
pretty sure it wasn't some external network latency. Beyond that, as soon as
I purged the larger queues the problem fixed itself so it definitely seemed
coupled with the backlogged queues. To purge the queues I basically just
deleted them and re-created them using the java client API.

> Also, to confirm, you are running 2 rabbitmq brokers on 2 different EC2
> instances (1 
> broker per instance) and the problem happened on both instances, both
> brokers at the same 
> time and brokers are not in a rabbitmq cluster, right?

Yes, they are on two instances but they are in fact clustered. Now that you
mention it, I remember that one of the nodes seemed to have no connections
to it while the other had a handful -- I also noticed that one node was more
responsive than the other. I'm wondering if perhaps it was a node-localized
issue but the effects were propagated due to the clustering?

> Can you replay your deployment scenario (maybe not in production, of
> course) and check if 
> you get the same problem? Alternatively, if I were to recreate an edge
> case of your 
> scenario, I assume I'll need N producers sending messages with size=1K to
> the same queue 
> (say amq.direct exchange, routing_key=foo) at aggregate rate 60 messages
> per second for 1 
> hour without consuming, and the expected result is that broker will enter
> the bad state? 
> What value corresponds to N in your situation?

The number at the time of the failure were that we had approximately 100k
items in aggregate in the queues and the total amount of data on disk was
50MB across both nodes, so the average message size was probably between 512
bytes and 1024 bytes. There were 5 producers running at the time -- once the
problems started occuring we noticed that the producers were still getting
messages through (the client library we wrote does a retry with a buffer of
pending messages) but they were trickling in at a much lower rate than
normal. (5-10 tps instead of 40-60.) 

Which version of rabbitmq were you running and on which erlang?

1.5.0/R11B

> Did you have a chance to capture any system runtime info (netstat,
> tcpdump, etc) while 
> this was happening?

Unfortunately not, I'm not a sysadmin wiz so I'm not positive how to grab
this type of TCP info.

Thanks again!
-- 
View this message in context: http://www.nabble.com/RabbitMQ-timing-out-under-small-load-tp21315520p21318246.html
Sent from the RabbitMQ mailing list archive at Nabble.com.





More information about the rabbitmq-discuss mailing list