Hi, Simone...<br><br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>I am relatively new to Rabbitmq and would appreciate help in troubleshooting a recurring issue on a cluster, apologies for the long email.</div>
</blockquote><div><br></div><div>No problem!  Details are usually helpful...</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>I run a 3 instances cluster in ec2, 2 disk nodes (A and B) 1 ram node (C), exchanges and queues are static and limited in number (less than 50), the volume of messages can reach a few thousands per second and the queues can occasionally grow up to a few hundred thousands until the processes manage to catch up, but this is well within our memory/disk high watermark. Rabbitmq is v. 2.8.4 on ubuntu 12.</div>
</blockquote><div><br></div><div>2.8.4 is now getting a bit old, and was from the middle of a sequence of bug fix releases, during which many things improved...  you might want to consider upgrading.  Before you do, see remarks below.</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div>I would like to better understand the crash report and perhaps have some ideas on what went wrong and how to more effectively troubleshoot issues (what more info should I collect before restarting the nodes, erlang processes list, mnesia tables, ets tables etc).</div>
<div><br></div><div><div>=CRASH REPORT==== 14-Feb-2013::10:36:54 ===</div><div>  crasher:</div><div>    initial call: rabbit_reader:init/4</div><div><span>    pid: &lt;</span><b>0.29283.387</b><span>&gt;</span></div>

<div>    registered_name: []</div><div>    exception error: bad argument</div><div>      in function  port_close/1</div><div>         called as port_close(#Port&lt;0.746540&gt;)</div><div>      in call from rabbit_net:maybe_fast_close/1</div>


<div>      in call from rabbit_reader:start_connection/7</div><div>    ancestors: [&lt;0.29280.387&gt;,rabbit_tcp_client_sup,rabbit_sup,&lt;0.161.0&gt;]</div><div>    messages: []</div><div>    links: [&lt;0.29280.387&gt;]</div>


<div>    dictionary: [{{channel,10},</div><div>                   {&lt;0.29364.387&gt;,{method,rabbit_framing_amqp_0_9_1}}},</div><div>                  {{ch_pid,&lt;0.29338.387&gt;},{7,#Ref&lt;0.0.2158.60093&gt;}},</div>


<div>                  {{ch_pid,&lt;0.29333.387&gt;},{6,#Ref&lt;0.0.2158.60085&gt;}},</div><div>                  {{ch_pid,&lt;0.29325.387&gt;},{5,#Ref&lt;0.0.2158.60053&gt;}},</div><div>                  {{channel,3},</div>


<div>                   {&lt;0.29313.387&gt;,{method,rabbit_framing_amqp_0_9_1}}},</div><div>                  {{ch_pid,&lt;0.29305.387&gt;},{2,#Ref&lt;0.0.2158.60002&gt;}},</div><div>                  {{channel,4},</div>


<div>                   {&lt;0.29321.387&gt;,{method,rabbit_framing_amqp_0_9_1}}},</div><div>                  {{channel,11},</div><div>                   {&lt;0.29370.387&gt;,{method,rabbit_framing_amqp_0_9_1}}},</div><div>


                  {{ch_pid,&lt;0.29313.387&gt;},{3,#Ref&lt;0.0.2158.60017&gt;}},</div><div>                  {{ch_pid,&lt;0.29299.387&gt;},{1,#Ref&lt;0.0.2158.59976&gt;}},</div><div>                  {{ch_pid,&lt;0.29346.387&gt;},{8,#Ref&lt;0.0.2158.60112&gt;}},</div>


<div>                  {{ch_pid,&lt;0.29370.387&gt;},{11,#Ref&lt;0.0.2158.60189&gt;}},</div><div>                  {{channel,7},</div><div>                   {&lt;0.29338.387&gt;,{method,rabbit_framing_amqp_0_9_1}}},</div>


<div>                  {{channel,9},</div><div>                   {&lt;0.29356.387&gt;,{method,rabbit_framing_amqp_0_9_1}}},</div><div>                  {{ch_pid,&lt;0.29321.387&gt;},{4,#Ref&lt;0.0.2158.60034&gt;}},</div>


<div>                  {{ch_pid,&lt;0.29364.387&gt;},{10,#Ref&lt;0.0.2158.60166&gt;}},</div><div>                  {{ch_pid,&lt;0.29356.387&gt;},{9,#Ref&lt;0.0.2158.60140&gt;}},</div><div>                  {{channel,8},</div>


<div>                   {&lt;0.29346.387&gt;,{method,rabbit_framing_amqp_0_9_1}}},</div><div>                  {{channel,5},</div><div>                   {&lt;0.29325.387&gt;,{method,rabbit_framing_amqp_0_9_1}}},</div><div>


                  {{channel,1},</div><div>                   {&lt;0.29299.387&gt;,</div><div>                    {content_body,</div><div>                        {&#39;basic.publish&#39;,0,&lt;&lt;&quot;some_exchange&quot;&gt;&gt;,&lt;&lt;&gt;&gt;,false,</div>


<div>                            false},</div><div>                        1048189,</div><div>                        {content,60,none,</div><div>                            &lt;&lt;BYTES IN HERE&gt;&gt;,   --&gt; this showed which process was sending the message</div>


<div>                            rabbit_framing_amqp_0_9_1,</div><div>                            [&lt;&lt;MORE BYTES IN HERE&gt;&gt;]  --&gt; This I haven&#39;t been able to decode, it is fairly big, is it truncated?</div>
</div></blockquote><div><br></div><div>Unfortunately, nothing springs to mind immediately to pursue from this...</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>

</div><div>And in the logs we can find the pid <b>0.29283.387 </b>right before the crash:</div><div><br></div><div>=INFO REPORT==== 14-Feb-2013::10:31:46 ===</div><div><span>accepting AMQP connection &lt;<b>0.29283.387</b>&gt; (10.xx.xx.xx:58622 -&gt; 10.</span>xx.xx.xx<span>:5672)</span></div>


<div><br></div><div>=INFO REPORT==== 14-Feb-2013::10:31:46 ===</div><div><span>accepting AMQP connection &lt;0.29287.387&gt; (10.</span>xx.xx.xx<span>:58623 -&gt; 10.</span>xx.xx.xx<span>:5672)</span></div><div><br></div>


<div>=WARNING REPORT==== 14-Feb-2013::10:32:27 ===</div><div><span>closing AMQP connection &lt;0.27107.387&gt; (10.</span>xx.xx.xx<span>:50882 -&gt; 10.</span>xx.xx.xx<span>:5672):</span></div><div>connection_closed_abruptly</div>
</blockquote><div><br></div><div>This could be pretty much anything, from client misbehavior to connection disruption...</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>Looking at the rabbitmqctl report I have not been able to map the memory consumption to something specific yet.<br></div></blockquote><div><br></div><div>I&#39;d proceed with an upgrade to a more recent Rabbit...  check here first:</div>
<div><br></div><div><a href="http://www.rabbitmq.com/blog/2012/11/19/breaking-things-with-rabbitmq-3-0/">http://www.rabbitmq.com/blog/2012/11/19/breaking-things-with-rabbitmq-3-0/</a></div><div><br></div><div>And if none of the changes in 3.0.x are going to provide you short term inconvenience, then try going straight to the latest 3.0.2; if there are 3.0 changes that you think will bother you or require changes to your apps or infrastructure, then jump to 2.8.6 for now... it was the seventh and last of the 2.8.x series and contains a pile of incremental fixes that may help with this (otherwise tricky to diagnose from what we have available right now) problem.</div>
<div><br></div><div>Best regards,</div><div>Jerry</div><div><br></div></div>