[rabbitmq-discuss] major failure when clearing large queue backlog

Mon Aug 15 18:29:23 BST 2011

I have seen those log files of which you speak. They are ..... large, to say
the least.

I just checked in to app and SASL and there's nothing from that time frame.
On the host which had the queue in question, the best I have is 78k SASL
messages over the course of the day that look like the one I pasted below.

=SUPERVISOR REPORT==== 9-Aug-2011::12:19:16 ===
     Supervisor: {<0.7200.4>,rabbit_channel_sup_sup}
     Context:    shutdown_error
     Reason:     shutdown
     Offender:   [{pid,<0.15122.24>},
                  {name,channel_sup},
                  {mfa,{rabbit_channel_sup,start_link,[]}},
                  {restart_type,temporary},
                  {shutdown,infinity},
                  {child_type,supervisor}]

I can email you logs, but they're really not exciting. It's all just noise
as messages were published and consumers had yet to initialize and declare
exchanges, queues and bindings.

I assume you have tests for spooling to disk when rabbit is low on memory.
Has this been tested in a cluster situation such as I described?

I checked our configurations, and I wonder if this is what caused the
problem:
[{rabbit, [{vm_memory_high_watermark, 0}, {cluster_nodes,
['artemis','hermes']}]}].

-Aaron

On Mon, Aug 15, 2011 at 9:21 AM, Matthew Sackman <matthew at rabbitmq.com>wrote:

> Hi Aaron,
>
> On Fri, Aug 12, 2011 at 02:27:25PM -0400, Aaron Westendorf wrote:
> > We've put two new rabbit 2.5 hosts into production at the same time as we
> > performed some other major upgrades to our software on 9 August. As
> > designed, our messages queued up in rabbit, and we slowly consumed them.
> > When the queues were empty, rabbit blew up and took out both hosts in the
> > cluster.
>
> Not quite as designed. Do you have the logs available for these machines
> from as they blew up that you can send us (maybe off list)?
>
> > Shortly after the queues drained, rabbit became completely unresponsive.
> Our
> > `watch` process to list_queues and check on the state stopped running. We
> > observed heavy swap churn via kswapd[0-9]+. It appeared that rabbit was
> > trying to load in all of the data that it had paged to disk.  I've
> attached
> > graphs that show when this event occurred, but they're the weekly rollups
> so
> > it's hidden in the noise. The "max" column is the best record of the
> limits
> > that were hit, especially the "committed" value which is significantly
> > higher than the 8GB of RAM available.
>
> To me, this sounds like Rabbit was about to crash. Sadly, Erlang is very
> very poor at string handling. When processes in Erlang crash, they have
> their last known state, stacktrace, and "last message in" dumped to the
> log. This is fantastic from a debugging pov, but Erlang is so awful at
> converting all this data to a string to put into the log, that I
> regularly see Erlang eat GBs of RAM, eventually running out and then
> swapping to death when converting a few MB of state to a string. This
> seems to match what happened to you.
>
> However, Rabbit certainly shouldn't have crashed, and we've had very few
> crash reports with 2.5.1. It's possible that the logs contain at least
> part of the reason for the crash, and if so, they'd help a lot. But
> other than that there's very little for us to go on to try and help out.
>
> If you have the logs available and if they contain errors then they'd be
> of great help. Has Rabbit been behaving itself since or did you
> downgrade back to your previously known-good version?
>
> Best wishes,
>
> Matthew
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>

-- 
Aaron Westendorf
Senior Software Engineer
Agora Games
359 Broadway
Troy, NY 12180
Phone: 518.268.1000
aaron at agoragames.com
www.agoragames.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20110815/54947f12/attachment.htm>