[rabbitmq-discuss] Long pauses when closing many channels simultaneously

Thu Sep 26 16:26:36 BST 2013

On 26 September 2013 14:10, Michael Klishin <michael at rabbitmq.com> wrote:

>
> On sep 26, 2013, at 1:50 p.m., josh <martin.rogan.inc at gmail.com> wrote:
>
> > Let me restate that... With 30K+30K channels the first 100 each take 10
> seconds to close using 100 simultaneous threads. The remaining 59,900 each
> take less than 0.5 seconds. My feeling is that there's some funky
> connection-wide synchronization/continuation going on here. Hit the
> connection up with 100 channel-close requests on 100 threads simultaneously
> and it baulks. Whatever causes that initial spasm doesn't seem to affect
> subsequent close operations and everything swims along nicely.
>
> This is correct. Closing either a channel or connection involves waiting
> for a reply from RabbitMQ.
> Iit would be interested to see thread dumps and as much information about
> lock contention you can provide. My guess is that it is _channelMap but I'm
> not a very reliable prediction machine.
>
>
In tag 3.1.5 I can point to the close(...) method in ChannelN.java at line
569:

            // Now that we're in quiescing state, channel.close was sent and
            // we wait for the reply. We ignore the result.
            // (It's NOT always close-ok.)
            notify = true;
            k.getReply(-1);

Here k.getReply(-1) does the waiting. In my dodgy mod I skipped these two
lines and also the finally block (notify==false):

        } finally {
            if (abort || notify) {
                // Now we know everything's been cleaned up and there should
                // be no more surprises arriving on the wire. Release the
                // channel number, and dissociate this ChannelN instance
from
                // our connection so that any further frames inbound on this
                // channel can be caught as the errors they are.
                releaseChannel();
                notifyListeners();
            }
        }

Hence the channel resource leak and subsequent OOM. Although the delay
disappeared and the channels were closed on the server it doesn't reveal
where the delay was incurred. The client may have just been waiting for
replies to come in after other data on the connection, with no lock
contention, but on the other hand how do the subsequent closures get
processed so much quicker?

>
> > I've tried ramping up the number of connections to relieve the pressure.
> This certainly works with predictable results. With 30K+30K connections
> spread evenly over 2 connections the initial 100 channel-close delays are
> halved from 10 seconds to 5 seconds. Use 10 connections and the delay is
> imperceptible when compared to the subsequent 59,900 channel closures. Jump
> to 50K+50K channels (we can do this with 10 connections but not 1
> connection due to channel-max) and the delays start to creep back in again.
>
> Again, hard to tell what the contention point is without runtime data.
>
> >
> > My concerns with this approach are that 1) multiple connections are
> discouraged in the documentation due to i/o resource overhead and that 2)
> it's not clear for my application how to sensibly predict the optimum
> number of channels per connection. If there is a soft limit to the number
> of channels per connection why is it not documented or made available in
> the api?
>
> See ConnectionFactory.DEFAULT_CHANNEL_MAX and
> ConnectionFactory#setRequestedChannelMax.
>
> Note that some clients have a different default (like 65536 channels).
>

In my 3.1.5 client ConnectionFactory.DEFAULT_CHANNEL_MAX==0 and
connection.getChannelMax()==65,536.

>
> > I've tried my hand at modifying the client library by not waiting for
> channel-close acknowledgements from the RabbitMQ server. This worked like a
> charm. Channels were closed instantly with no delay in the client and
> confirmed as closed on the server. Eight hours later though and I was out
> of heap space as the channel resources internal to the client library were
> not being released. I haven't managed to isolate the source of the delay
> either... is it in the client library or the server itself?
>
> You need to make sure that ChannelManager#disconnectChannel is used.
> VisualVM should
> pretty quickly show what objects use most heap space.

As revealed by YourKit mountains of Channels were not cleared up by my
dodgy mod skipping disconnectChannel(). But I figured it was unsafe to
invoke as it we don't know "everything's been cleaned up and there should
be no more surprises arriving on the wire."

>
>
> > Questions:
> >
> > Before making application changes I'd like to know if this is a known
> issue with the Java client?
>
> I've seen this before with 2 other clients. In one case the problem was
> different and mostly solved
> (I have not tried 60K channels but for 6-8K it worked reasonably well).
> Another client is built on the
> Java one. So, it's a known problem that few people run into.
>
>

A few seconds here and there is not so problematic really. RabbitMQ is so
critical to our application though that we need to ensure we're not falling
off any edges.

> Are there better workarounds than multiple connections and
> application-level channel management? In practise my actual application
> uses around 20K channels per process, which I don't feel is excessive, and
> message throughput is actually pretty light as I'm leveraging RabbitMQ more
> for it's routing capabilities. if you think the number of channels is a
> problem in itself then please say so! I could refactor to use less channels
> but then I'd be sharing channels and would either have to synchronize their
> usage or ignore documentation guidelines.
>
> This is something that should be improved in the Java client, but in the
> meantime you may need
> to use a pool of connections that will open channels using round robin or
> similar.
>
>
Done.

> > The error handling paradigm makes this cumbersome though; any channel
> error results in it's termination so it's difficult to isolate errors,
> prevent them from permeating across unrelated publishers/consumers and
> recover in a robust manner.
>
> This is in part why having one channel per thread is a very good idea.
>
> To summarize: yes, this is a known but rare problem. If you can provide
> profiling and thread dump
> information that will help isolating the contention point, I think the
> issue can be resolved or largely
> mitigated in a future version.
>
>
Thanks. Will do. Would you prefer a plain old Java app that you can profile
yourself?

MK
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130926/30e529ba/attachment.htm>