<div dir="ltr"><div>Hi there,</div><div><br></div><div>Hoping someone can help me out.  We recently experienced 2 crashes with our RabbitMQ cluster.  After the first crash, we moved the Mnesia directories elsewhere, and started RabbitMQ again.  This got us up and running.  Second time it happened, we had the original nodes plus an additional 5 nodes we had added to the cluster that we were planning to leave in place while shutting the old nodes down.</div>


<div><br></div><div>During the crash symptoms were as follows:</div><div><br></div><div>- Escalating (and sudden) CPU utilisation on some (but not all) nodes</div><div>- Escalation memory usage (not necessarily aligned to the spiking CPU)</div>


<div>- Increasing time to publish on queues (and specifically on a test queue we have setup that exists only to test publishing and consuming from the cluster hosts)</div><div>- Running `rabbitmqctl cluster status` gets increasingly slow (some nodes eventually taking up to 10m to return with the response data - some were fast and took 5s)</div>


<div>- Management plugin stops responding / or responding so slowly it&#39;s no longer loading any data at all (probably same thing that causes the preceeding item)</div><div>- Can&#39;t force nodes to forget other nodes (calling `rabbitmqctl forget_cluster_node` doesn&#39;t return)</div>


<div><br></div><div>- When trying to shut down a node, running `rabbitmqctl stop_app` appears to block on epmd and doesn&#39;t return</div><div>--- When that doesn&#39;t return we eventually have to ctrl-c the command</div>


<div>--- We have to issue a kill signal to rabbit to stop it</div><div>--- Do the same to the epmd process</div><div>--- However the other nodes all still think that the killed node is active (based on `rabbitmqctl cluster status` -- both nodes slow to run this, and fast to run it saw the same view of the cluster that included the dead node)</div>


<div><br></div><div><br></div><div>Config / details as follows (we use mirrored queues -- 5 hosts, all disc nodes, with a global policy that all queues are mirrored &quot;ha-mode:all&quot;), running on Linux</div><div><br>


</div><div>[</div><div>        {rabbit, [</div><div>                {cluster_nodes, {[&#39;rabbit@b05.internal&#39;, &#39;rabbit@b06.internal&#39;,&#39;rabbit@b07.internal&#39;,&#39;rabbit@b08.internal&#39;,&#39;rabbit@b09.internal&#39;], disc}},</div>


<div>                {cluster_partition_handling, pause_minority}</div><div>        ]}</div><div>]</div><div><br></div><div>And the env:</div><div><br></div><div>NODENAME=&quot;rabbit@b09.internal&quot;</div><div>SERVER_ERL_ARGS=&quot;-kernel inet_dist_listen_min 27248 -kernel inet_dist_listen_max 27248&quot;</div>


<div><br></div><div>The erl_crash.dump slogan error was : eheap_alloc: Cannot allocate 2850821240 bytes of memory (of type &quot;old_heap&quot;).</div><div><br></div><div>System version : Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false]</div>


<div><br></div><div>Compiled<span style="white-space:pre-wrap">        </span>Fri Dec 16 03:22:15 2011</div><div>Taints<span style="white-space:pre-wrap">        </span>(none)</div><div>Memory allocated<span style="white-space:pre-wrap">        </span>6821368760 bytes</div>


<div>Atoms<span style="white-space:pre-wrap">        </span>22440</div><div>Processes<span style="white-space:pre-wrap">        </span>4899</div><div>ETS tables<span style="white-space:pre-wrap">        </span>80</div><div>Timers<span style="white-space:pre-wrap">        </span>23</div>


<div>Funs<span style="white-space:pre-wrap">        </span>3994</div><div><br></div><div>When I look at the Process Information it seems there&#39;s a small number with ALOT of messages queued, and the rest are an order of magnitude lower:</div>


<div><br></div><div>Pid<span style="white-space:pre-wrap">        </span>Name/Spawned as<span style="white-space:pre-wrap">        </span>State<span style="white-space:pre-wrap">        </span>Reductions<span style="white-space:pre-wrap">        </span>Stack+heap<span style="white-space:pre-wrap">        </span> MsgQ Length</div>


<div>&lt;0.400.0&gt;<span style="white-space:pre-wrap">        </span>proc_lib:init_p/5<span style="white-space:pre-wrap">        </span>Scheduled<span style="white-space:pre-wrap">        </span>146860259<span style="white-space:pre-wrap">        </span> 59786060<span style="white-space:pre-wrap">        </span>37148</div>


<div>&lt;0.373.0&gt;<span style="white-space:pre-wrap">        </span>proc_lib:init_p/5<span style="white-space:pre-wrap">        </span>Scheduled<span style="white-space:pre-wrap">        </span>734287949<span style="white-space:pre-wrap">        </span> 1346269<span style="white-space:pre-wrap">        </span>23360</div>


<div>&lt;0.366.0&gt;<span style="white-space:pre-wrap">        </span>proc_lib:init_p/5<span style="white-space:pre-wrap">        </span>Waiting<span style="white-space:pre-wrap">        </span>114695635<span style="white-space:pre-wrap">        </span> 5135590<span style="white-space:pre-wrap">        </span>19744</div>


<div>&lt;0.444.0&gt;<span style="white-space:pre-wrap">        </span>proc_lib:init_p/5<span style="white-space:pre-wrap">        </span>Waiting<span style="white-space:pre-wrap">        </span>154538610<span style="white-space:pre-wrap">        </span> 832040<span style="white-space:pre-wrap">        </span> 3326</div>


<div><br></div><div>when I view the second process (first one crashes erlang on me), I see a large number of sender_death events (not sure if these are common or highly unusual ?)</div><div><br></div><div>{&#39;$gen_cast&#39;,{gm,{sender_death,&lt;2710.20649.64&gt;}}}</div>


<div><br></div><div>mixed in with other more regular events:</div><div><br></div><div>{&#39;$gen_cast&#39;,</div><div>    {gm,{publish,&lt;2708.20321.59&gt;,</div><div>            {message_properties,undefined,false},</div>


<div>            {basic_message,</div><div>&lt;.. snip..&gt;</div></div>