[rabbitmq-discuss] RabbitMQ blocking issue

Konar Peter (Nokia-LC/Chicago) peter.konar at nokia.com
Wed Feb 20 21:54:17 GMT 2013


++Justin and Craig

-----Original Message-----
From: Konar Peter (Nokia-LC/Chicago)
Sent: Wednesday, February 20, 2013 2:00 PM
To: ext Matthias Radestock; Carroll James (Nokia-LC/Malvern)
Cc: Discussions about RabbitMQ; Yu Jerry.3 (Nokia-LC/Alpharetta); Lieber Dick (Nokia-LC/Chicago); Johnson Anthony (Nokia-LC/Alpharetta); Mccuiston Mitchell (Nokia-LC/Chicago)
Subject: RE: [rabbitmq-discuss] RabbitMQ blocking issue

Matthias,
   We have automated the recovery of logs for this issue to speed analysis. I have attached two logs that show issues with in the first 5 minutes of it happening.
In "report.pt.13:35" we show that 10.196.42.180 in not outputting as scheduled. These logs include jstacks of the processes on the 5 machines outputting as well as rabbit reports and netstats.


report.pt.13:40‎ (516 KB‎) shows that there are two machines

Thanks
Peter Konar



______________________________________
From: ext Matthias Radestock [matthias at rabbitmq.com]
Sent: Thursday, February 14, 2013 2:21 AM
To: Carroll James (Nokia-LC/Malvern)
Cc: Konar Peter (Nokia-LC/Chicago); Discussions about RabbitMQ; Yu Jerry.3 (Nokia-LC/Alpharetta); Lieber Dick (Nokia-LC/Chicago); Johnson Anthony (Nokia-LC/Alpharetta); Mccuiston Mitchell (Nokia-LC/Chicago)
Subject: Re: [rabbitmq-discuss] RabbitMQ blocking issue

James,

On 13/02/13 23:40, Carroll James (Nokia-LC/Malvern) wrote:
> Yup. The times are about right. See the connections from 182. The
> first email show port 49191 which was opened at 15:17 GMT. It was
> hung for quite a while before we got to it but the server closed it
> at 21:22 and opened a new one. That's around the time we were in
> there.
>
> If it matters the log is attached.

Here's the relevant portion:

<quote>
=INFO REPORT==== 13-Feb-2013::15:17:00 ===
accepting AMQP connection <0.2252.1282> (10.196.42.182:49191 ->
10.196.42.21:5672)

=ERROR REPORT==== 13-Feb-2013::21:22:01 ===
closing AMQP connection <0.2252.1282> (10.196.42.182:49191 ->
10.196.42.21:5672):
{heartbeat_timeout,running}
</quote>

So the connection was open for quite a while and then got closed due to
a missed heartbeat.

The default heartbeat interval is 10 minutes, but the server will only
time out between two and three intervals after the last data has been
received, so it could take up to half an hour for the server to kill a
connection that has gone bad.

The question is why the server isn't seeing any data. There are two
obvious explanations:
a) there is a network disruption, and
b) the server has stopped reading from the socket

The presence of a non-empty Recv-Q on the server-side connection points
to the latter. We need to get hold of a 'rabbitmqctl report' showing
that connection while the Recv-Q is non-empty, so we know what state the
server thought the connection was in at the time.

Regards,

Matthias.

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.


More information about the rabbitmq-discuss mailing list