[rabbitmq-discuss] major failure when clearing large queue backlog
matthew at rabbitmq.com
Mon Aug 15 14:21:14 BST 2011
On Fri, Aug 12, 2011 at 02:27:25PM -0400, Aaron Westendorf wrote:
> We've put two new rabbit 2.5 hosts into production at the same time as we
> performed some other major upgrades to our software on 9 August. As
> designed, our messages queued up in rabbit, and we slowly consumed them.
> When the queues were empty, rabbit blew up and took out both hosts in the
Not quite as designed. Do you have the logs available for these machines
from as they blew up that you can send us (maybe off list)?
> Shortly after the queues drained, rabbit became completely unresponsive. Our
> `watch` process to list_queues and check on the state stopped running. We
> observed heavy swap churn via kswapd[0-9]+. It appeared that rabbit was
> trying to load in all of the data that it had paged to disk. I've attached
> graphs that show when this event occurred, but they're the weekly rollups so
> it's hidden in the noise. The "max" column is the best record of the limits
> that were hit, especially the "committed" value which is significantly
> higher than the 8GB of RAM available.
To me, this sounds like Rabbit was about to crash. Sadly, Erlang is very
very poor at string handling. When processes in Erlang crash, they have
their last known state, stacktrace, and "last message in" dumped to the
log. This is fantastic from a debugging pov, but Erlang is so awful at
converting all this data to a string to put into the log, that I
regularly see Erlang eat GBs of RAM, eventually running out and then
swapping to death when converting a few MB of state to a string. This
seems to match what happened to you.
However, Rabbit certainly shouldn't have crashed, and we've had very few
crash reports with 2.5.1. It's possible that the logs contain at least
part of the reason for the crash, and if so, they'd help a lot. But
other than that there's very little for us to go on to try and help out.
If you have the logs available and if they contain errors then they'd be
of great help. Has Rabbit been behaving itself since or did you
downgrade back to your previously known-good version?
More information about the rabbitmq-discuss