I have seen those log files of which you speak. They are ..... large, to say the least.<div><br></div><div>I just checked in to app and SASL and there's nothing from that time frame. On the host which had the queue in question, the best I have is 78k SASL messages over the course of the day that look like the one I pasted below.</div>
<div><div><br></div><div>=SUPERVISOR REPORT==== 9-Aug-2011::12:19:16 ===</div><div> Supervisor: {<0.7200.4>,rabbit_channel_sup_sup}</div><div> Context: shutdown_error</div><div> Reason: shutdown</div>
<div> Offender: [{pid,<0.15122.24>},</div><div> {name,channel_sup},</div><div> {mfa,{rabbit_channel_sup,start_link,[]}},</div><div> {restart_type,temporary},</div>
<div> {shutdown,infinity},</div><div> {child_type,supervisor}]</div><div><br></div><div><br></div><div>I can email you logs, but they're really not exciting. It's all just noise as messages were published and consumers had yet to initialize and declare exchanges, queues and bindings.</div>
<div><br></div><div>I assume you have tests for spooling to disk when rabbit is low on memory. Has this been tested in a cluster situation such as I described?</div><div><br></div><div>I checked our configurations, and I wonder if this is what caused the problem:</div>
<div><div>[{rabbit, [{vm_memory_high_watermark, 0}, {cluster_nodes, ['artemis','hermes']}]}].</div></div><div><br></div><div><br></div><div>-Aaron</div><div><br></div><div><br></div><div><br></div><div><br>
</div><br><div class="gmail_quote">On Mon, Aug 15, 2011 at 9:21 AM, Matthew Sackman <span dir="ltr"><<a href="mailto:matthew@rabbitmq.com">matthew@rabbitmq.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Hi Aaron,<br>
<div class="im"><br>
On Fri, Aug 12, 2011 at 02:27:25PM -0400, Aaron Westendorf wrote:<br>
> We've put two new rabbit 2.5 hosts into production at the same time as we<br>
> performed some other major upgrades to our software on 9 August. As<br>
> designed, our messages queued up in rabbit, and we slowly consumed them.<br>
> When the queues were empty, rabbit blew up and took out both hosts in the<br>
> cluster.<br>
<br>
</div>Not quite as designed. Do you have the logs available for these machines<br>
from as they blew up that you can send us (maybe off list)?<br>
<div class="im"><br>
> Shortly after the queues drained, rabbit became completely unresponsive. Our<br>
> `watch` process to list_queues and check on the state stopped running. We<br>
> observed heavy swap churn via kswapd[0-9]+. It appeared that rabbit was<br>
> trying to load in all of the data that it had paged to disk. I've attached<br>
> graphs that show when this event occurred, but they're the weekly rollups so<br>
> it's hidden in the noise. The "max" column is the best record of the limits<br>
> that were hit, especially the "committed" value which is significantly<br>
> higher than the 8GB of RAM available.<br>
<br>
</div>To me, this sounds like Rabbit was about to crash. Sadly, Erlang is very<br>
very poor at string handling. When processes in Erlang crash, they have<br>
their last known state, stacktrace, and "last message in" dumped to the<br>
log. This is fantastic from a debugging pov, but Erlang is so awful at<br>
converting all this data to a string to put into the log, that I<br>
regularly see Erlang eat GBs of RAM, eventually running out and then<br>
swapping to death when converting a few MB of state to a string. This<br>
seems to match what happened to you.<br>
<br>
However, Rabbit certainly shouldn't have crashed, and we've had very few<br>
crash reports with 2.5.1. It's possible that the logs contain at least<br>
part of the reason for the crash, and if so, they'd help a lot. But<br>
other than that there's very little for us to go on to try and help out.<br>
<br>
If you have the logs available and if they contain errors then they'd be<br>
of great help. Has Rabbit been behaving itself since or did you<br>
downgrade back to your previously known-good version?<br>
<br>
Best wishes,<br>
<font color="#888888"><br>
Matthew<br>
</font><div><div></div><div class="h5">_______________________________________________<br>
rabbitmq-discuss mailing list<br>
<a href="mailto:rabbitmq-discuss@lists.rabbitmq.com">rabbitmq-discuss@lists.rabbitmq.com</a><br>
<a href="https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss" target="_blank">https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss</a><br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>Aaron Westendorf<br>Senior Software Engineer<br>Agora Games<br>359 Broadway<br>Troy, NY 12180<br>Phone: 518.268.1000<br><a href="mailto:aaron@agoragames.com">aaron@agoragames.com</a> <br>
<a href="http://www.agoragames.com">www.agoragames.com</a><br><br>
</div>