[rabbitmq-discuss] major failure when clearing large queue backlog

Aaron Westendorf aaron at agoragames.com
Fri Aug 12 19:27:25 BST 2011


We've put two new rabbit 2.5 hosts into production at the same time as we
performed some other major upgrades to our software on 9 August. As
designed, our messages queued up in rabbit, and we slowly consumed them.
When the queues were empty, rabbit blew up and took out both hosts in the
cluster.

Hosts:
2 x 12 core AMD Opteron 6172
8GB ram

The situation here is that we had lots of requests, 1-2k/sec, coming in on
one host, routed to the queues on the second, to which there was connected a
pool of consumers. For a variety of reasons, that pool of consumers was
smaller than we wanted and we had trouble increasing the count. The first
incident occurred after the backlog reported by rabbit reached over 700k;
the second peak below 300k. Message size ranged from 2-20k.

In both cases, both rabbit hosts started spooling to disk as free memory
dropped. We do not use qos, and so our consumers also had significant memory
allocated to frames on their input queues. Eventually, we were able to spawn
enough consumers to slowly drain the queues. At the rate messages were
arriving, this took awhile, but everything stayed within spec.

Shortly after the queues drained, rabbit became completely unresponsive. Our
`watch` process to list_queues and check on the state stopped running. We
observed heavy swap churn via kswapd[0-9]+. It appeared that rabbit was
trying to load in all of the data that it had paged to disk.  I've attached
graphs that show when this event occurred, but they're the weekly rollups so
it's hidden in the noise. The "max" column is the best record of the limits
that were hit, especially the "committed" value which is significantly
higher than the 8GB of RAM available.

-Aaron


-- 
Aaron Westendorf
Senior Software Engineer
Agora Games
359 Broadway
Troy, NY 12180
Phone: 518.268.1000
aaron at agoragames.com
www.agoragames.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20110812/5cff119f/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: load-week.png
Type: image/png
Size: 16202 bytes
Desc: not available
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20110812/5cff119f/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: memory-week.png
Type: image/png
Size: 29857 bytes
Desc: not available
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20110812/5cff119f/attachment-0001.png>


More information about the rabbitmq-discuss mailing list