<div dir="ltr"><div>Hi there,</div><div><br></div><div>Hoping someone can help me out. �We recently experienced 2 crashes with our RabbitMQ cluster. �After the first crash, we moved the Mnesia directories elsewhere, and started RabbitMQ again. �This got us up and running. �Second time it happened, we had the original nodes plus an additional 5 nodes we had added to the cluster that we were planning to leave in place while shutting the old nodes down.</div>
<div><br></div><div>During the crash symptoms were as follows:</div><div><br></div><div>- Escalating (and sudden) CPU utilisation on some (but not all) nodes</div><div>- Escalation memory usage (not necessarily aligned to the spiking CPU)</div>
<div>- Increasing time to publish on queues (and specifically on a test queue we have setup that exists only to test publishing and consuming from the cluster hosts)</div><div>- Running `rabbitmqctl cluster status` gets increasingly slow (some nodes eventually taking up to 10m to return with the response data - some were fast and took 5s)</div>
<div>- Management plugin stops responding / or responding so slowly it's no longer loading any data at all (probably same thing that causes the preceeding item)</div><div>- Can't force nodes to forget other nodes (calling `rabbitmqctl forget_cluster_node` doesn't return)</div>
<div><br></div><div>- When trying to shut down a node, running `rabbitmqctl stop_app` appears to block on epmd and doesn't return</div><div>--- When that doesn't return we eventually have to ctrl-c the command</div>
<div>--- We have to issue a kill signal to rabbit to stop it</div><div>--- Do the same to the epmd process</div><div>--- However the other nodes all still think that the killed node is active (based on `rabbitmqctl cluster status` -- both nodes slow to run this, and fast to run it saw the same view of the cluster that included the dead node)</div>
<div><br></div><div><br></div><div>Config / details as follows (we use mirrored queues -- 5 hosts, all disc nodes, with a global policy that all queues are mirrored "ha-mode:all"), running on Linux</div><div><br>
</div><div>[</div><div>� � � � {rabbit, [</div><div>� � � � � � � � {cluster_nodes, {['rabbit@b05.internal', 'rabbit@b06.internal','rabbit@b07.internal','rabbit@b08.internal','rabbit@b09.internal'], disc}},</div>
<div>� � � � � � � � {cluster_partition_handling, pause_minority}</div><div>� � � � ]}</div><div>]</div><div><br></div><div>And the env:</div><div><br></div><div>NODENAME="rabbit@b09.internal"</div><div>SERVER_ERL_ARGS="-kernel inet_dist_listen_min 27248 -kernel inet_dist_listen_max 27248"</div>
<div><br></div><div>The erl_crash.dump slogan error was : eheap_alloc: Cannot allocate 2850821240 bytes of memory (of type "old_heap").</div><div><br></div><div>System version : Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false]</div>
<div><br></div><div>Compiled<span style="white-space:pre-wrap">        </span>Fri Dec 16 03:22:15 2011</div><div>Taints<span style="white-space:pre-wrap">        </span>(none)</div><div>Memory allocated<span style="white-space:pre-wrap">        </span>6821368760 bytes</div>
<div>Atoms<span style="white-space:pre-wrap">        </span>22440</div><div>Processes<span style="white-space:pre-wrap">        </span>4899</div><div>ETS tables<span style="white-space:pre-wrap">        </span>80</div><div>Timers<span style="white-space:pre-wrap">        </span>23</div>
<div>Funs<span style="white-space:pre-wrap">        </span>3994</div><div><br></div><div>When I look at the Process Information it seems there's a small number with ALOT of messages queued, and the rest are an order of magnitude lower:</div>
<div><br></div><div>Pid<span style="white-space:pre-wrap">        </span>Name/Spawned as<span style="white-space:pre-wrap">        </span>State<span style="white-space:pre-wrap">        </span>Reductions<span style="white-space:pre-wrap">        </span>Stack+heap<span style="white-space:pre-wrap">        </span> MsgQ Length</div>
<div><0.400.0><span style="white-space:pre-wrap">        </span>proc_lib:init_p/5<span style="white-space:pre-wrap">        </span>Scheduled<span style="white-space:pre-wrap">        </span>146860259<span style="white-space:pre-wrap">        </span> 59786060<span style="white-space:pre-wrap">        </span>37148</div>
<div><0.373.0><span style="white-space:pre-wrap">        </span>proc_lib:init_p/5<span style="white-space:pre-wrap">        </span>Scheduled<span style="white-space:pre-wrap">        </span>734287949<span style="white-space:pre-wrap">        </span> 1346269<span style="white-space:pre-wrap">        </span>23360</div>
<div><0.366.0><span style="white-space:pre-wrap">        </span>proc_lib:init_p/5<span style="white-space:pre-wrap">        </span>Waiting<span style="white-space:pre-wrap">        </span>114695635<span style="white-space:pre-wrap">        </span> 5135590<span style="white-space:pre-wrap">        </span>19744</div>
<div><0.444.0><span style="white-space:pre-wrap">        </span>proc_lib:init_p/5<span style="white-space:pre-wrap">        </span>Waiting<span style="white-space:pre-wrap">        </span>154538610<span style="white-space:pre-wrap">        </span> 832040<span style="white-space:pre-wrap">        </span> 3326</div>
<div><br></div><div>when I view the second process (first one crashes erlang on me), I see a large number of sender_death events (not sure if these are common or highly unusual ?)</div><div><br></div><div>{'$gen_cast',{gm,{sender_death,<2710.20649.64>}}}</div>
<div><br></div><div>mixed in with other more regular events:</div><div><br></div><div>{'$gen_cast',</div><div>� � {gm,{publish,<2708.20321.59>,</div><div>� � � � � � {message_properties,undefined,false},</div>
<div>� � � � � � {basic_message,</div><div><.. snip..></div></div>