<div dir="ltr">On 17 October 2013 01:29, Tim Watson <span dir="ltr">&lt;<a href="mailto:tim@rabbitmq.com" target="_blank">tim@rabbitmq.com</a>&gt;</span> wrote:<br><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
Hello David!<br>
<div class="im"><br></div></blockquote><div><br></div><div>Hey Tim, thanks for replying so quickly!</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div class="im">
On 16 Oct 2013, at 15:14, David Harrison wrote:<br>
&gt; Hoping someone can help me out.  We recently experienced 2 crashes with our RabbitMQ cluster.  After the first crash, we moved the Mnesia directories elsewhere, and started RabbitMQ again.  This got us up and running.  Second time it happened, we had the original nodes plus an additional 5 nodes we had added to the cluster that we were planning to leave in place while shutting the old nodes down.<br>

&gt;<br>
<br>
</div>What version of rabbit are you running, and how was it installed?<br></blockquote><div><br></div><div>3.1.5, running on Ubuntu Precise, installed via deb package.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<div class="im"><br>
&gt; During the crash symptoms were as follows:<br>
&gt;<br>
&gt; - Escalating (and sudden) CPU utilisation on some (but not all) nodes<br>
<br>
</div>We&#39;ve fixed at least one bug with that symptom in recent releases.<br></blockquote><div><br></div><div>I think 3.1.5 is the latest stable ??</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<div class="im"><br>
&gt; - Increasing time to publish on queues (and specifically on a test queue we have setup that exists only to test publishing and consuming from the cluster hosts)<br>
<br>
</div>Are there multiple publishers on the same connection/channel when this happens? It wouldn&#39;t be unusual, if the server was struggling, to see flow control kick in and affect publishers in this fashion.<br></blockquote>
<div><br></div><div>Yes in some cases there would be, for our test queue there wouldn&#39;t be -- we saw up to 10s on the test queue though</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<div class="im"><br>
&gt; - Running `rabbitmqctl cluster status` gets increasingly slow (some nodes eventually taking up to 10m to return with the response data - some were fast and took 5s)<br>
<br>
</div>Wow, 10m is amazingly slow. Can you provide log files for this period of activity and problems?<br></blockquote><div><br></div><div>I&#39;ll take a look, we saw a few &quot;too many processes&quot; messages,</div><div>
<br></div><div>&quot;<span class="">Generic</span> <span class="">server</span> <span class="">net_kernel</span> <span class="">terminating&quot; followed by :</span></div><div><span class=""><br></span></div><div><span class=""><div>
** Reason for termination ==</div><div>** {system_limit,[{erlang,spawn_opt,</div><div>[inet_tcp_dist,do_setup,</div><div>[&lt;0.19.0&gt;,&#39;rabbit@b02.internal&#39;,normal,</div><div>&#39;rabbit@b00.internal&#39;,longnames,7000],</div>
<div>[link,{priority,max}]]},</div><div>{net_kernel,setup,4},</div><div>{net_kernel,handle_call,3},</div><div>{gen_server,handle_msg,5},</div><div>{proc_lib,init_p_do_apply,3}]}</div></span></div><div> </div><div><pre><div id="LC402" class="">
<span class="">=</span><span class="">ERROR</span> <span class="">REPORT</span><span class="">====</span> <span class="">15</span><span class="">-</span><span class="">Oct</span><span class="">-</span><span class="">2013</span><span class="">::</span><span class="">16</span><span class="">:</span><span class="">07</span><span class="">:</span><span class="">10</span> <span class="">===</span></div>
<div id="LC403" class=""><span class="">**</span> <span class="">gen_event</span> <span class="">handler</span> <span class="">rabbit_error_logger</span> <span class="">crashed</span><span class="">.</span></div><div id="LC404" class="">
<span class="">**</span> <span class="">Was</span> <span class="">installed</span> <span class="">in</span> <span class="">error_logger</span></div><div id="LC405" class=""><span class="">**</span> <span class="">Last</span> <span class="">event</span> <span class="">was</span><span class="">:</span> <span class="">{</span><span class="">error</span><span class="">,</span><span class="">&lt;</span><span class="">0</span><span class="">.</span><span class="">8</span><span class="">.</span><span class="">0</span><span class="">&gt;</span><span class="">,{</span><span class="">emulator</span><span class="">,</span><span class="">&quot;</span><span class="">~s~n</span><span class="">&quot;</span><span class="">,[</span><span class="">&quot;Too many processes</span><span class="">\n</span><span class="">&quot;</span><span class="">]}}</span></div>
<div id="LC406" class=""><span class="">**</span> <span class="">When</span> <span class="">handler</span> <span class="">state</span> <span class="">==</span> <span class="">{</span><span class="">resource</span><span class="">,</span><span class="">&lt;&lt;</span><span class="">&quot;/&quot;</span><span class="">&gt;&gt;</span><span class="">,</span><span class="">exchange</span><span class="">,</span><span class="">&lt;&lt;</span><span class="">&quot;amq.rabbitmq.log&quot;</span><span class="">&gt;&gt;</span><span class="">}</span></div>
<div id="LC407" class=""><span class="">**</span> <span class="">Reason</span> <span class="">==</span> <span class="">{</span><span class="">aborted</span><span class="">,</span></div><div id="LC408" class="">                 <span class="">{</span><span class="">no_exists</span><span class="">,</span></div>
<div id="LC409" class="">                     <span class="">[</span><span class="">rabbit_topic_trie_edge</span><span class="">,</span></div><div id="LC410" class="">                      <span class="">{</span><span class="">trie_edge</span><span class="">,</span></div>
<div id="LC411" class="">                          <span class="">{</span><span class="">resource</span><span class="">,</span><span class="">&lt;&lt;</span><span class="">&quot;/&quot;</span><span class="">&gt;&gt;</span><span class="">,</span><span class="">exchange</span><span class="">,</span><span class="">&lt;&lt;</span><span class="">&quot;amq.rabbitmq.log&quot;</span><span class="">&gt;&gt;</span><span class="">},</span></div>
<div id="LC412" class="">                          <span class="">root</span><span class="">,</span><span class="">&quot;error&quot;</span><span class="">}]}}</span></div><div id="LC412" class=""><span class=""><br></span></div>
<div id="LC412" class=""><span class=""><br></span></div><div id="LC412" class=""><span class="">=ERROR REPORT==== 15-Oct-2013::16:07:10 ===
Mnesia(nonode@nohost): ** ERROR ** mnesia_controller got unexpected info: {&#39;EXIT&#39;,
&lt;0.97.0&gt;,
shutdown}<br></span></div><div id="LC412" class=""><span class=""><br></span></div><div id="LC412" class=""><span class="">=ERROR REPORT==== 15-Oct-2013::16:11:38 ===
Mnesia(&#39;rabbit@b00.internal&#39;): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, &#39;rabbit@b01.internal&#39;}<br></span></div></pre></div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<div class="im"><br>
&gt; - When trying to shut down a node, running `rabbitmqctl stop_app` appears to block on epmd and doesn&#39;t return<br>
<br>
</div>Again, we&#39;ve fixed bugs in that area in recent releases.<br>
<div class="im"><br>
&gt; --- When that doesn&#39;t return we eventually have to ctrl-c the command<br>
&gt; --- We have to issue a kill signal to rabbit to stop it<br>
&gt; --- Do the same to the epmd process<br>
<br>
</div>Even if you have to `kill -9&#39; a rabbit node, you shouldn&#39;t need to kill epmd. In theory at least. If that was necessary to fix the &quot;state of the world&quot;, it would be indicative of a problem related to the erlang distribution mechanism, but I very much doubt that&#39;s the case here.<br>

<div class="im"><br>
&gt; Config / details as follows (we use mirrored queues -- 5 hosts, all disc nodes, with a global policy that all queues are mirrored &quot;ha-mode:all&quot;), running on Linux<br>
&gt;<br>
<br>
</div>How many queues are we talking about here?<br></blockquote><div><br></div><div>~30</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<div class="im"><br>
&gt; [<br>
&gt;         {rabbit, [<br>
&gt;                 {cluster_nodes, {[&#39;rabbit@b05.internal&#39;, &#39;rabbit@b06.internal&#39;,&#39;rabbit@b07.internal&#39;,&#39;rabbit@b08.internal&#39;,&#39;rabbit@b09.internal&#39;], disc}},<br>
&gt;                 {cluster_partition_handling, pause_minority}<br>
<br>
</div>Are you sure that what you&#39;re seeing is not caused by a network partition? If it were, any nodes in a minority island would &quot;pause&quot;, which would certainly lead to the kind of symptoms you&#39;ve mentioned here, viz rabbitmqctl calls not returning and so on.<br>
</blockquote><div><br></div><div>There was definitely a network partition, but the whole cluster nose dived during the crash</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<div class="im"><br>
&gt; The erl_crash.dump slogan error was : eheap_alloc: Cannot allocate 2850821240 bytes of memory (of type &quot;old_heap&quot;).<br>
&gt;<br>
<br>
</div>That&#39;s a plain old OOM failure. Rabbit ought to start deliberately paging messages to disk well before that happens, which might also explain a lot of the slow/unresponsive-ness.<br></blockquote><div><br></div>
<div>These hosts aren&#39;t running swap, we give them a fair bit of RAM (gave them even more now as part of a possible stop gap) </div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<div class="im"><br>
&gt; System version : Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false]<br>
&gt;<br>
<br>
</div>I&#39;d strongly suggest upgrading to R16B02 if you can. R14 is pretty ancient and a *lot* of bug fixes have appeared in erts + OTP since then.<br>
<div class="im"><br></div></blockquote><div><br></div><div>ok good advice, we&#39;ll do that</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div class="im">
&gt; When I look at the Process Information it seems there&#39;s a small number with ALOT of messages queued, and the rest are an order of magnitude lower:<br>
&gt;<br>
<br>
</div>That&#39;s not unusual.<br>
<div class="im"><br>
&gt; when I view the second process (first one crashes erlang on me), I see a large number of sender_death events (not sure if these are common or highly unusual ?)<br>
&gt;<br>
&gt; {&#39;$gen_cast&#39;,{gm,{sender_death,&lt;2710.20649.64&gt;}}}<br>
&gt;<br>
<br>
</div>Interesting - will take a look at that. If you could provide logs for the participating nodes during this whole time period, that would help a lot.<br>
<div class="im"><br>
&gt; mixed in with other more regular events:<br>
&gt;<br>
<br>
</div>Actually, sender_death messages are not &quot;irregular&quot; as such. They&#39;re just notifying the GM group members that another member (on another node) has died. This is quite normal with mirrored queues, when nodes get partitioned or stopped due to cluster recovery modes.<br>

<br>
Cheers,<br>
Tim<br>
<br>
<br>
_______________________________________________<br>
rabbitmq-discuss mailing list<br>
<a href="mailto:rabbitmq-discuss@lists.rabbitmq.com">rabbitmq-discuss@lists.rabbitmq.com</a><br>
<a href="https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss" target="_blank">https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss</a><br>
</blockquote></div><br></div></div>