[rabbitmq-discuss] RabbitMQ server closing connection every few minutes

Tim Watson tim at rabbitmq.com
Tue May 22 09:25:51 BST 2012


On 22/05/2012 09:00, Tim Watson wrote:
> On 22/05/2012 08:48, Tim Watson wrote:
>> On 22/05/2012 07:34, David Tinker wrote:
>>> Hi
>>>
>>> We have a Java application using RabbitMQ that keeps getting its
>>> connection closed by the server every few minutes. On the client side
>>> we see this:
>>>
>>> 2012-05-22 05:30:05,337 [AMQP Connection 127.0.0.1:5671] ERROR
>>> pork.RabbitService - Connection to Rabbit Exchange shutting down:
>>> com.rabbitmq.client.ShutdownSignalException: connection error; reason:
>>> java.net.SocketException: Connection reset
>>> Reason: java.net.SocketException: Connection reset
>>>
>>> On the server side we get this:
>>>
>>> =INFO REPORT==== 22-May-2012::05:25:20 ===
>>> accepting AMQP connection<0.32660.612> (127.0.0.1:45458 ->
>>> 127.0.0.1:5671)
>>>
>>> =ERROR REPORT==== 22-May-2012::05:30:05 ===
>>> closing AMQP connection<0.32660.612> (127.0.0.1:45458 ->
>>> 127.0.0.1:5671):
>>> {timeout,running}
>>>
>>
>> Hi David. The second error message will appear in the logs when there
>> has been an un-handled exception in rabbit's 'reader process'. The
>> '{timeout, running}' state occurs when receiving data from the network
>> takes too long (which is where the 'timeout' tag originates) but the
>> AMQP connection state is 'running' - the second tag. When this condition
>> arises, the broker's reader process shuts down the connection.
>>
>> So at first glance, it does look like a networking problem to me,
>> although TBH I'm not sure about the nature of it just yet. I suspect it
>> could have something to do with "... It can sometimes be several minutes
>> before a message is acked or nacked ...".
>>
>>> It happens more frequently when the system is busy (500-1000
>>> messages/s) than when it is cruising (50 messages/s). The application
>>> and RabbitMQ are on the same box so I don't think its a
>>> networking/firewall problem.
>>>
>>> The application only uses one connection and one channel. Messages are
>>> received using Connection.basicConsume(..) and acked or nacked
>>> sometime later on different threads. It can sometimes be several
>>> minutes before a message is acked or nacked. We are using
>>> Connection.basicQos(10000) to limit the number of outstanding messages
>>> but its not often that reaches 10000 and the error happens anytime.
>>> There is defensive code to make sure messages are not acked/nacked
>>> more than once.
>>>
>>> Anyone have any ideas? Is there any way to get more information on the
>>> server or client side regarding the error?
>>>
>>
>> Well the server error seems pretty clear and just to be 100% sure, I can
>> verify that the error message you're seeing is only generated in one
>> place in the whole code base:
>> https://github.com/rabbitmq/rabbitmq-server/blob/master/src/rabbit_reader.erl#L241.
>>
>>
>>
>
> Except that this isn't quite as clear as I suggested initially. The
> networking layer isn't going to send a timeout on a blocking receive
> like this - the 'timeout' message is being sent to the reader from some
> other process. I will investigate further! :)
>

Ok so I think I've identified the source of this behaviour. The 
'timeout' message is sent to the reader process by the AMQP heartbeat - 
a separate out of band process that is monitoring for 'stale' 
connections using heartbeat frames (see 
http://www.rabbitmq.com/amqp-0-9-1-reference.html for details).

What I believe is happening is that as the server comes under load, the 
heartbeat for this connection is de-prioritised at the network layer and 
times out, causing the connection to be closed.

Set your heartbeat to 0 (turning it off) for now and see if it fixes the 
problem? That's a short term investigative step though, obviously.

I'll file a bug for improving the log messages when timeout occurs.

>> The code handling the timeout condition is
>> https://github.com/rabbitmq/rabbitmq-server/blob/master/src/rabbit_reader.erl#L321,
>>
>> the call site for which is
>> https://github.com/rabbitmq/rabbitmq-server/blob/master/src/rabbit_reader.erl#L284.
>>
>> The origin of this is
>> https://github.com/rabbitmq/rabbitmq-server/blob/master/src/rabbit_net.erl#L108
>>
>> in rabbit_net - what the code does is make the calling process wait for
>> a message from the Erlang networking layer. So it is the underlying
>> TCP/IP layer in Erlang which is noticing a timeout on the connection.
>>
>> How this is mediated through the client's behaviour is another matter.
>>
>> Hope this helps us get started figuring out what's wrong!
>>
>> Cheers,
>> Tim
>>
>>> RabbitMQ 2.8.1 on Gentoo Linux 64bit
>>> Erlang R15B (erts-5.9) [source] [64-bit]
>>> amqp-client-2.8.2 (also tried amqp-client-2.7.1)
>>> java version "1.6.0_31"
>>> Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
>>> Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)
>>>
>>> Thanks
>>> David
>>> _______________________________________________
>>> rabbitmq-discuss mailing list
>>> rabbitmq-discuss at lists.rabbitmq.com
>>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> rabbitmq-discuss at lists.rabbitmq.com
>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss



More information about the rabbitmq-discuss mailing list