<div dir="ltr">Quick update on the queue count: 56<div class="gmail_extra"><br><div class="gmail_quote">On 17 October 2013 02:29, David Harrison <span dir="ltr">&lt;<a href="mailto:dave.l.harrison@gmail.com" target="_blank">dave.l.harrison@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">On 17 October 2013 01:29, Tim Watson <span dir="ltr">&lt;<a href="mailto:tim@rabbitmq.com" target="_blank">tim@rabbitmq.com</a>&gt;</span> wrote:<br>

<div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

Hello David!<br>

<div><br></div></blockquote><div><br></div><div>Hey Tim, thanks for replying so quickly!</div><div class="im"><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


<div>

On 16 Oct 2013, at 15:14, David Harrison wrote:<br>

&gt; Hoping someone can help me out.  We recently experienced 2 crashes with our RabbitMQ cluster.  After the first crash, we moved the Mnesia directories elsewhere, and started RabbitMQ again.  This got us up and running.  Second time it happened, we had the original nodes plus an additional 5 nodes we had added to the cluster that we were planning to leave in place while shutting the old nodes down.<br>


&gt;<br>

<br>

</div>What version of rabbit are you running, and how was it installed?<br></blockquote><div><br></div></div><div>3.1.5, running on Ubuntu Precise, installed via deb package.</div><div class="im"><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


<div><br>

&gt; During the crash symptoms were as follows:<br>

&gt;<br>

&gt; - Escalating (and sudden) CPU utilisation on some (but not all) nodes<br>

<br>

</div>We&#39;ve fixed at least one bug with that symptom in recent releases.<br></blockquote><div><br></div></div><div>I think 3.1.5 is the latest stable ??</div><div class="im"><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


<div><br>

&gt; - Increasing time to publish on queues (and specifically on a test queue we have setup that exists only to test publishing and consuming from the cluster hosts)<br>

<br>

</div>Are there multiple publishers on the same connection/channel when this happens? It wouldn&#39;t be unusual, if the server was struggling, to see flow control kick in and affect publishers in this fashion.<br></blockquote>


<div><br></div></div><div>Yes in some cases there would be, for our test queue there wouldn&#39;t be -- we saw up to 10s on the test queue though</div><div class="im"><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


<div><br>

&gt; - Running `rabbitmqctl cluster status` gets increasingly slow (some nodes eventually taking up to 10m to return with the response data - some were fast and took 5s)<br>

<br>

</div>Wow, 10m is amazingly slow. Can you provide log files for this period of activity and problems?<br></blockquote><div><br></div></div><div>I&#39;ll take a look, we saw a few &quot;too many processes&quot; messages,</div>

<div>

<br></div><div>&quot;<span>Generic</span> <span>server</span> <span>net_kernel</span> <span>terminating&quot; followed by :</span></div><div><span><br></span></div><div><span><div>

** Reason for termination ==</div><div>** {system_limit,[{erlang,spawn_opt,</div><div>[inet_tcp_dist,do_setup,</div><div>[&lt;0.19.0&gt;,&#39;rabbit@b02.internal&#39;,normal,</div><div>&#39;rabbit@b00.internal&#39;,longnames,7000],</div>


<div>[link,{priority,max}]]},</div><div>{net_kernel,setup,4},</div><div>{net_kernel,handle_call,3},</div><div>{gen_server,handle_msg,5},</div><div>{proc_lib,init_p_do_apply,3}]}</div></span></div><div> </div><div><pre><div>


<span>=</span><span>ERROR</span> <span>REPORT</span><span>====</span> <span>15</span><span>-</span><span>Oct</span><span>-</span><span>2013</span><span>::</span><span>16</span><span>:</span><span>07</span><span>:</span><span>10</span> <span>===</span></div>


<div><span>**</span> <span>gen_event</span> <span>handler</span> <span>rabbit_error_logger</span> <span>crashed</span><span>.</span></div><div>

<span>**</span> <span>Was</span> <span>installed</span> <span>in</span> <span>error_logger</span></div><div><span>**</span> <span>Last</span> <span>event</span> <span>was</span><span>:</span> <span>{</span><span>error</span><span>,</span><span>&lt;</span><span>0</span><span>.</span><span>8</span><span>.</span><span>0</span><span>&gt;</span><span>,{</span><span>emulator</span><span>,</span><span>&quot;</span><span>~s~n</span><span>&quot;</span><span>,[</span><span>&quot;Too many processes</span><span>\n</span><span>&quot;</span><span>]}}</span></div>


<div><span>**</span> <span>When</span> <span>handler</span> <span>state</span> <span>==</span> <span>{</span><span>resource</span><span>,</span><span>&lt;&lt;</span><span>&quot;/&quot;</span><span>&gt;&gt;</span><span>,</span><span>exchange</span><span>,</span><span>&lt;&lt;</span><span>&quot;amq.rabbitmq.log&quot;</span><span>&gt;&gt;</span><span>}</span></div>


<div><span>**</span> <span>Reason</span> <span>==</span> <span>{</span><span>aborted</span><span>,</span></div><div>                 <span>{</span><span>no_exists</span><span>,</span></div>

<div>                     <span>[</span><span>rabbit_topic_trie_edge</span><span>,</span></div><div>                      <span>{</span><span>trie_edge</span><span>,</span></div>

<div>                          <span>{</span><span>resource</span><span>,</span><span>&lt;&lt;</span><span>&quot;/&quot;</span><span>&gt;&gt;</span><span>,</span><span>exchange</span><span>,</span><span>&lt;&lt;</span><span>&quot;amq.rabbitmq.log&quot;</span><span>&gt;&gt;</span><span>},</span></div>


<div>                          <span>root</span><span>,</span><span>&quot;error&quot;</span><span>}]}}</span></div><div><span><br></span></div>

<div><span><br></span></div><div><span>=ERROR REPORT==== 15-Oct-2013::16:07:10 ===

Mnesia(nonode@nohost): ** ERROR ** mnesia_controller got unexpected info: {&#39;EXIT&#39;,

&lt;0.97.0&gt;,

shutdown}<br></span></div><div><span><br></span></div><div><span>=ERROR REPORT==== 15-Oct-2013::16:11:38 ===

Mnesia(&#39;rabbit@b00.internal&#39;): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, &#39;rabbit@b01.internal&#39;}<br></span></div></pre></div><div class="im"><div><br></div><div><br>

</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


<div><br>

&gt; - When trying to shut down a node, running `rabbitmqctl stop_app` appears to block on epmd and doesn&#39;t return<br>

<br>

</div>Again, we&#39;ve fixed bugs in that area in recent releases.<br>

<div><br>

&gt; --- When that doesn&#39;t return we eventually have to ctrl-c the command<br>

&gt; --- We have to issue a kill signal to rabbit to stop it<br>

&gt; --- Do the same to the epmd process<br>

<br>

</div>Even if you have to `kill -9&#39; a rabbit node, you shouldn&#39;t need to kill epmd. In theory at least. If that was necessary to fix the &quot;state of the world&quot;, it would be indicative of a problem related to the erlang distribution mechanism, but I very much doubt that&#39;s the case here.<br>


<div><br>

&gt; Config / details as follows (we use mirrored queues -- 5 hosts, all disc nodes, with a global policy that all queues are mirrored &quot;ha-mode:all&quot;), running on Linux<br>

&gt;<br>

<br>

</div>How many queues are we talking about here?<br></blockquote><div><br></div></div><div>~30</div><div class="im"><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


<div><br>

&gt; [<br>

&gt;         {rabbit, [<br>

&gt;                 {cluster_nodes, {[&#39;rabbit@b05.internal&#39;, &#39;rabbit@b06.internal&#39;,&#39;rabbit@b07.internal&#39;,&#39;rabbit@b08.internal&#39;,&#39;rabbit@b09.internal&#39;], disc}},<br>

&gt;                 {cluster_partition_handling, pause_minority}<br>

<br>

</div>Are you sure that what you&#39;re seeing is not caused by a network partition? If it were, any nodes in a minority island would &quot;pause&quot;, which would certainly lead to the kind of symptoms you&#39;ve mentioned here, viz rabbitmqctl calls not returning and so on.<br>


</blockquote><div><br></div></div><div>There was definitely a network partition, but the whole cluster nose dived during the crash</div><div class="im"><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


<div><br>

&gt; The erl_crash.dump slogan error was : eheap_alloc: Cannot allocate 2850821240 bytes of memory (of type &quot;old_heap&quot;).<br>

&gt;<br>

<br>

</div>That&#39;s a plain old OOM failure. Rabbit ought to start deliberately paging messages to disk well before that happens, which might also explain a lot of the slow/unresponsive-ness.<br></blockquote><div><br></div>


</div><div>These hosts aren&#39;t running swap, we give them a fair bit of RAM (gave them even more now as part of a possible stop gap) </div><div class="im"><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


<div><br>

&gt; System version : Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false]<br>

&gt;<br>

<br>

</div>I&#39;d strongly suggest upgrading to R16B02 if you can. R14 is pretty ancient and a *lot* of bug fixes have appeared in erts + OTP since then.<br>

<div><br></div></blockquote><div><br></div></div><div>ok good advice, we&#39;ll do that</div><div class="im"><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


<div>

&gt; When I look at the Process Information it seems there&#39;s a small number with ALOT of messages queued, and the rest are an order of magnitude lower:<br>

&gt;<br>

<br>

</div>That&#39;s not unusual.<br>

<div><br>

&gt; when I view the second process (first one crashes erlang on me), I see a large number of sender_death events (not sure if these are common or highly unusual ?)<br>

&gt;<br>

&gt; {&#39;$gen_cast&#39;,{gm,{sender_death,&lt;2710.20649.64&gt;}}}<br>

&gt;<br>

<br>

</div>Interesting - will take a look at that. If you could provide logs for the participating nodes during this whole time period, that would help a lot.<br>

<div><br>

&gt; mixed in with other more regular events:<br>

&gt;<br>

<br>

</div>Actually, sender_death messages are not &quot;irregular&quot; as such. They&#39;re just notifying the GM group members that another member (on another node) has died. This is quite normal with mirrored queues, when nodes get partitioned or stopped due to cluster recovery modes.<br>


<br>

Cheers,<br>

Tim<br>

<br>

<br>

_______________________________________________<br>

rabbitmq-discuss mailing list<br>

<a href="mailto:rabbitmq-discuss@lists.rabbitmq.com" target="_blank">rabbitmq-discuss@lists.rabbitmq.com</a><br>

<a href="https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss" target="_blank">https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss</a><br>

</blockquote></div></div><br></div></div>

</blockquote></div><br></div></div>