[rabbitmq-discuss] RabbitMQ blocking issue

Konar Peter (Nokia-LC/Chicago) peter.konar at nokia.com
Wed Feb 20 21:54:42 GMT 2013


++Justin and Craig

-----Original Message-----
From: Konar Peter (Nokia-LC/Chicago)
Sent: Wednesday, February 20, 2013 2:02 PM
To: Konar Peter (Nokia-LC/Chicago); ext Matthias Radestock; Carroll James (Nokia-LC/Malvern); dbuchko at vmware.com
Cc: Discussions about RabbitMQ; Yu Jerry.3 (Nokia-LC/Alpharetta); Lieber Dick (Nokia-LC/Chicago); Johnson Anthony (Nokia-LC/Alpharetta); Mccuiston Mitchell (Nokia-LC/Chicago)
Subject: RE: [rabbitmq-discuss] RabbitMQ blocking issue

Adding Rabbit logs
pk
________________________________________
From: Konar Peter (Nokia-LC/Chicago)
Sent: Wednesday, February 20, 2013 2:00 PM
To: ext Matthias Radestock; Carroll James (Nokia-LC/Malvern)
Cc: Discussions about RabbitMQ; Yu Jerry.3 (Nokia-LC/Alpharetta); Lieber Dick (Nokia-LC/Chicago); Johnson Anthony (Nokia-LC/Alpharetta); Mccuiston Mitchell (Nokia-LC/Chicago)
Subject: RE: [rabbitmq-discuss] RabbitMQ blocking issue

Matthias,
   We have automated the recovery of logs for this issue to speed analysis. I have attached two logs that show issues with in the first 5 minutes of it happening.
In "report.pt.13:35" we show that 10.196.42.180 in not outputting as scheduled. These logs include jstacks of the processes on the 5 machines outputting as well as rabbit reports and netstats.


report.pt.13:40‎ (516 KB‎) shows that there are two machines

Thanks
Peter Konar



______________________________________
From: ext Matthias Radestock [matthias at rabbitmq.com]
Sent: Thursday, February 14, 2013 2:21 AM
To: Carroll James (Nokia-LC/Malvern)
Cc: Konar Peter (Nokia-LC/Chicago); Discussions about RabbitMQ; Yu Jerry.3 (Nokia-LC/Alpharetta); Lieber Dick (Nokia-LC/Chicago); Johnson Anthony (Nokia-LC/Alpharetta); Mccuiston Mitchell (Nokia-LC/Chicago)
Subject: Re: [rabbitmq-discuss] RabbitMQ blocking issue

James,

On 13/02/13 23:40, Carroll James (Nokia-LC/Malvern) wrote:
> Yup. The times are about right. See the connections from 182. The
> first email show port 49191 which was opened at 15:17 GMT. It was hung
> for quite a while before we got to it but the server closed it at
> 21:22 and opened a new one. That's around the time we were in there.
>
> If it matters the log is attached.

Here's the relevant portion:

<quote>
=INFO REPORT==== 13-Feb-2013::15:17:00 === accepting AMQP connection <0.2252.1282> (10.196.42.182:49191 ->
10.196.42.21:5672)

=ERROR REPORT==== 13-Feb-2013::21:22:01 === closing AMQP connection <0.2252.1282> (10.196.42.182:49191 ->
10.196.42.21:5672):
{heartbeat_timeout,running}
</quote>

So the connection was open for quite a while and then got closed due to a missed heartbeat.

The default heartbeat interval is 10 minutes, but the server will only time out between two and three intervals after the last data has been received, so it could take up to half an hour for the server to kill a connection that has gone bad.

The question is why the server isn't seeing any data. There are two obvious explanations:
a) there is a network disruption, and
b) the server has stopped reading from the socket

The presence of a non-empty Recv-Q on the server-side connection points to the latter. We need to get hold of a 'rabbitmqctl report' showing that connection while the Recv-Q is non-empty, so we know what state the server thought the connection was in at the time.

Regards,

Matthias.

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.


More information about the rabbitmq-discuss mailing list