<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On 26 September 2013 14:10, Michael Klishin <span dir="ltr">&lt;<a href="mailto:michael@rabbitmq.com" target="_blank">michael@rabbitmq.com</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br>

On sep 26, 2013, at 1:50 p.m., josh &lt;<a href="mailto:martin.rogan.inc@gmail.com">martin.rogan.inc@gmail.com</a>&gt; wrote:<br>

<br>

&gt; Let me restate that... With 30K+30K channels the first 100 each take 10 seconds to close using 100 simultaneous threads. The remaining 59,900 each take less than 0.5 seconds. My feeling is that there&#39;s some funky connection-wide synchronization/continuation going on here. Hit the connection up with 100 channel-close requests on 100 threads simultaneously and it baulks. Whatever causes that initial spasm doesn&#39;t seem to affect subsequent close operations and everything swims along nicely.<br>


<br>

This is correct. Closing either a channel or connection involves waiting for a reply from RabbitMQ.<br>

Iit would be interested to see thread dumps and as much information about lock contention you can provide. My guess is that it is _channelMap but I&#39;m not a very reliable prediction machine.<br>

<br></blockquote><div> </div><div>In tag 3.1.5 I can point to the close(...) method in ChannelN.java at line 569:</div><div><br></div><div>            // Now that we&#39;re in quiescing state, channel.close was sent and</div>

<div>            // we wait for the reply. We ignore the result.</div><div>            // (It&#39;s NOT always close-ok.)</div><div>            notify = true;</div><div>            k.getReply(-1);</div><div><br></div><div>

Here k.getReply(-1) does the waiting. In my dodgy mod I skipped these two lines and also the finally block (notify==false):</div><div><br></div><div><div>        } finally {</div><div>            if (abort || notify) {</div>

<div>                // Now we know everything&#39;s been cleaned up and there should</div><div>                // be no more surprises arriving on the wire. Release the</div><div>                // channel number, and dissociate this ChannelN instance from</div>

<div>                // our connection so that any further frames inbound on this</div><div>                // channel can be caught as the errors they are.</div><div>                releaseChannel();</div><div>                notifyListeners();</div>

<div>            }</div><div>        }</div></div><div><br></div><div>Hence the channel resource leak and subsequent OOM. Although the delay disappeared and the channels were closed on the server it doesn&#39;t reveal where the delay was incurred. The client may have just been waiting for replies to come in after other data on the connection, with no lock contention, but on the other hand how do the subsequent closures get processed so much quicker?</div>

<div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

&gt;<br>

&gt; I&#39;ve tried ramping up the number of connections to relieve the pressure. This certainly works with predictable results. With 30K+30K connections spread evenly over 2 connections the initial 100 channel-close delays are halved from 10 seconds to 5 seconds. Use 10 connections and the delay is imperceptible when compared to the subsequent 59,900 channel closures. Jump to 50K+50K channels (we can do this with 10 connections but not 1 connection due to channel-max) and the delays start to creep back in again.<br>


<br>

Again, hard to tell what the contention point is without runtime data.<br>

<br>

&gt;<br>

&gt; My concerns with this approach are that 1) multiple connections are discouraged in the documentation due to i/o resource overhead and that 2) it&#39;s not clear for my application how to sensibly predict the optimum number of channels per connection. If there is a soft limit to the number of channels per connection why is it not documented or made available in the api?<br>


<br>

See ConnectionFactory.DEFAULT_CHANNEL_MAX and ConnectionFactory#setRequestedChannelMax.<br>

<br>

Note that some clients have a different default (like 65536 channels).<br></blockquote><div><br></div><div><br></div><div>In my 3.1.5 client ConnectionFactory.DEFAULT_CHANNEL_MAX==0 and connection.getChannelMax()==65,536.</div>

<div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

&gt;<br>

&gt; I&#39;ve tried my hand at modifying the client library by not waiting for channel-close acknowledgements from the RabbitMQ server. This worked like a charm. Channels were closed instantly with no delay in the client and confirmed as closed on the server. Eight hours later though and I was out of heap space as the channel resources internal to the client library were not being released. I haven&#39;t managed to isolate the source of the delay either... is it in the client library or the server itself?<br>


<br>

You need to make sure that ChannelManager#disconnectChannel is used. VisualVM should<br>

pretty quickly show what objects use most heap space.</blockquote><div><br></div><div><br></div><div>As revealed by YourKit mountains of Channels were not cleared up by my dodgy mod skipping disconnectChannel(). But I figured it was unsafe to invoke as it we don&#39;t know &quot;everything&#39;s been cleaned up and there should be no more surprises arriving on the wire.&quot;</div>

<div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> <br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


&gt;<br>

&gt; Questions:<br>

&gt;<br>

&gt; Before making application changes I&#39;d like to know if this is a known issue with the Java client?<br>

<br>

I&#39;ve seen this before with 2 other clients. In one case the problem was different and mostly solved<br>

(I have not tried 60K channels but for 6-8K it worked reasonably well). Another client is built on the<br>

Java one. So, it&#39;s a known problem that few people run into.<br>

<br></blockquote><div><br></div><div><br></div><div>A few seconds here and there is not so problematic really. RabbitMQ is so critical to our application though that we need to ensure we&#39;re not falling off any edges.</div>

<div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

&gt; Are there better workarounds than multiple connections and application-level channel management? In practise my actual application uses around 20K channels per process, which I don&#39;t feel is excessive, and message throughput is actually pretty light as I&#39;m leveraging RabbitMQ more for it&#39;s routing capabilities. if you think the number of channels is a problem in itself then please say so! I could refactor to use less channels but then I&#39;d be sharing channels and would either have to synchronize their usage or ignore documentation guidelines.<br>


<br>

This is something that should be improved in the Java client, but in the meantime you may need<br>

to use a pool of connections that will open channels using round robin or similar.<br>

<br></blockquote><div><br></div><div>Done.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


&gt; The error handling paradigm makes this cumbersome though; any channel error results in it&#39;s termination so it&#39;s difficult to isolate errors, prevent them from permeating across unrelated publishers/consumers and recover in a robust manner.<br>


<br>

This is in part why having one channel per thread is a very good idea.<br>

<br>

To summarize: yes, this is a known but rare problem. If you can provide profiling and thread dump<br>

information that will help isolating the contention point, I think the issue can be resolved or largely<br>

mitigated in a future version.<br>

<span class=""><font color="#888888"><br></font></span></blockquote><div><br></div><div>Thanks. Will do. Would you prefer a plain old Java app that you can profile yourself?</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<span class=""><font color="#888888">

MK<br>

<br>

<br>

<br>

</font></span></blockquote></div><br></div></div>