[rabbitmq-discuss] Towards better handling of RabbitMQ connection/channel failures

Jonathan Halterman jhalterman at gmail.com
Thu Sep 19 01:37:13 BST 2013


Thanks for the response Tim.

On Wed, Sep 18, 2013 at 2:59 AM, Tim Watson <tim at rabbitmq.com> wrote:

> Hi Jonathan,
> On 17 Sep 2013, at 19:34, Jonathan Halterman wrote:
>
> Why are the shutdown listeners for only some of my channels/connections
> called when a rabbit server shuts down?
>
>
> All connection/channel shutdown listeners are triggered when the client
> detects the shutdown. When the shutdown originates at the server, this
> activity is mediated on the client side in one of two ways, either (a) the
> client received a `connection.close' AMQP method from the broker, or (b)
> the OS networking layer signals to the JVM that the socket has closed, at
> which point the listening/reading thread handles the relevant exception,
> tears down any associated local resources and fires the shutdown listeners.
> In the latter case, there can be a significant time delay before the
> operating system "notices" that the peer socket has closed/disappeared.
> Having said all that....
>
> On Mon, Sep 16, 2013 at 5:13 PM, Jonathan Halterman <jhalterman at gmail.com>wrote:
>
>> I've been experimenting with various sorts of RabbitMQ failures that
>> result in connections and channels being shutdown with the goal of being
>> able to re-establish connections, channels, and consumers whenever a
>> failure occurs. In particular, I've been forcing network partitions on a
>> pause_minority configured cluster with a client connected to what will
>> become the minority node, to see how things behave, and the results are a
>> bit inconsistent.
>>
>>
> How exactly are you forcing network partitions? Are you causing packet
> loss (using pf or iptables) or doing something else?
>

iptables


>
> For a simple test, I created 2 connections and 6 channels then partitioned
>> the cluster.
>>
>
> How did you partition the cluster?
>

Tweaking iptables to drop traffic to/from other nodes in the cluster.


>
>  Within a minute or so the minority node (to which my client is connected
>> connected) shuts itself down.
>>
>
> If a RabbitMQ node decides to undertake an orderly shutdown, then all AMQP
> connections should be explicitly closed (as per method "a" listed above)
> before the network connection is severed. Where this might not work as
> expected, is if the network connection between client and server is
> unavailable and/or subject to packet loss. If the `connection.close' signal
> the broker sends doesn't make it to the client, then the shutdown listeners
> won't fire until the client's (OS) network stack detects the problem, which
> can take up to 30 mins depending on environment configuration.
>
> What happens next varies a bit which each test run:
>>
>> Outcome 1: Immediately the shutdown listeners for my 2 connections and
>> all 6 channels are called.
>>
>>
> That is what I'd expect to happen if:
>
> 1. both connections are between the client and the broker that is shutting
> down
> 2. the network link between the broker that is shutting down and the
> client is in good condition (no packet loss, etc) such that the
> connection.close from the broker arrives at the client as expected
>
> Outcome 2: Immediately 2 of my 6 channels' shutdown listeners are called.
>> None of the connection shutdown listeners are called. After waiting a few
>> minutes I heal the partition and the shutdown listeners for the 2
>> connections and the remaining 4 channels are immediately called.
>>
>>
> That doesn't sound right. If both connections are between the client and
> the server that is shutting down, and there is a bug in the shutdown
> listener handling code, then this problem would be showing up all the time
> (and we'd have fixed it). It is also unnecessary to consider
> clustering/partitions is the behaviour you describe is happening for two
> connection between one client and one broker.
>
> Can you share a minimal example of the code you're using please.
>
> Outcome 3: Immediate 2 of my 6 channels' shutdown listeners are called.
>> None of the connection shutdown listeners are called. After about 30
>> seconds, with the cluster still partitioned, the shutdown listeners for the
>> 2 connections and the remaining 4 channels are immediately called.
>>
>>
> There are no timing guarantees about when shutdown listeners will fire. As
> I mentioned, these events are only triggered when either the client "sees"
> a `connection.close' from the broker or detects a network failure whilst
> listening/sending. Since both of these factors are entirely dependent on
> the network between client and server, and on the networking layers of the
> various participating operating systems, the `connection.close' and/or
> socket closed exception will be detected when the client's OS delivers the
> relevant signal to the JVM and up into the client library's application
> code, at which point it is handled immediately.
>
> In both these two cases, if some channels are being used to `send' data,
> and the disconnection between client and server involves loss of network
> connectivity, then the "sending" channels are most likely to "see"
> IOExceptions before the "listening" channels. Modern OS networking stacks
> are often configured with lower retry thresholds for sending than they are
> for receiving, thus detection of network failures will likely vary
> considerably depending on what you're doing in a particular channel over a
> particular connection.
>

I think you've basically hit on what I'm experiencing. The client in
question is serving as a consumer only.


>
> I'm interested to learn more about when and why certain shutdown listeners
>> might or might not be invoked so I can do a better job of re-establishing
>> resources after a failure. Any input is appreciated.
>>
>>
> If you can share an example of your client code, boiled down to the
> minimal details, that would help. Please also confirm exactly what your
> setup looks like, viz
>

I wrote a test attempting to reproduce what my actual client is
experiencing, but I was only able to come close to reproducing my client's
results when pushing a lot of volume down to the consumers, and even then
it was not consistent enough to draw any conclusions. At this point I'm
satisfied to simply tweak my client to account for potential delays in
ShutdownListener calls and move on. I just wanted to be sure that there
were no mechanisms introduced by amqp-client which could be causing any
additional ShutdownListener delays, and it sounds like there are not.

Cheers,
Jonathan


>
> 1. are both connections made between the client and exactly one server
>
2. how are you "partitioning" the server from the rest of the cluster
>
3. are you sending or receiving on the various channels that we're talking
> about
>

> Tim
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130918/4ec20b14/attachment.htm>


More information about the rabbitmq-discuss mailing list