[rabbitmq-discuss] RabbitMQ 2.0 hanging

Wed Sep 8 16:36:30 BST 2010

We installed RMQ 2.0 yesterday in our QA environment and noticed that it had 
hung this morning.

Env Setup:
- RabbitMQ 2.0 on Erlang R13B04 on Centos VMWare VM
- 2 nodes clustered together with both nodes as disk nodes
- Load balancer in front of both nodes round robin directing connections to each 
node
- Status plugin
- Message producers/consumers: Tomcat webapps using Spring-AMQP 1.0 M1 and 
RabbitMQ client 1.8.1
- Very low message volume as this is a dev/QA environment, practically none 
overnight

Problem:
8-Sep-2010::09:30~    - We couldn't start our Tomcat webapps on our local dev 
machines this morning because they hung when attempting to connect to RabbitMQ
8-Sep-2010::09:40~    - Could not load Status Plugin webpage 
8-Sep-2010::09:40~    - rabbitmqctl status on node 1 indicated everything was ok
8-Sep-2010::09:40~    - rabbitmqctl list_queues hung on node 1
8-Sep-2010::09:45:38 - rabbitmqctl stop_app and start_app on node 1 didn't solve 
the problem
8-Sep-2010::09:53:03 - rabbitmqctl stop and rabbitmq-server -detached on node 1 
fixed the problem
No commands were run on node 2 - because the person troubleshooting didn't have 
access to that machine :) 

History:
Something similar had happened before on RabbitMQ 1.8.1 as well. It happens like 
once every 2 weeks in our QA environment (sometimes several times a day but then 
it goes fine for 2 weeks), never happened on production environment. We have 
both Status and BQL plugins installed on RMQ 1.8.1 in production, but only 
Status plugin on RMQ 2.0 that we're testing in QA. We can try disabling plugins 
but I don't think that's the right way to troubleshoot this because the problem 
happens very rarely, it might lead us to believe the problem was in a plugin 
when it actually was not.

Logs:
Attached logs from both nodes. 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
=INFO REPORT==== 8-Sep-2010::09:30:52 ===
accepted TCP connection on 0.0.0.0:5672 from LOADBALANCER_IP:38339

=INFO REPORT==== 8-Sep-2010::09:30:52 ===
starting TCP connection <0.12371.17> from LOADBALANCER_IP:38339

=WARNING REPORT==== 8-Sep-2010::09:30:52 ===
exception on TCP connection <0.12371.17> from LOADBALANCER_IP:38339
connection_closed_abruptly

=INFO REPORT==== 8-Sep-2010::09:30:52 ===
closing TCP connection <0.12371.17> from LOADBALANCER_IP:38339
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There's plenty of these in the logs, that's just our load balancer checking 
periodically (once every minute) to see whether RabbitMQ is alive by opening and 
closing a TCP connection. I've been told this is harmless -> 
http://old.nabble.com/connection-closed-abruptly-errors-on-logs-ts29248096.html#a29248096

But there are more interesting/suspicious entries in the both nodes logs around 
09:45:38 in node1 and 09:41:38/09:47:08 on node2.

I hope you can help me figure out the root cause of the problem.

Dave

-------------- next part --------------
A non-text attachment was scrubbed...
Name: rabbitmq-node2.log
Type: application/octet-stream
Size: 29832 bytes
Desc: not available
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20100908/5cc34088/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: rabbitmq-node1.log
Type: application/octet-stream
Size: 24374 bytes
Desc: not available
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20100908/5cc34088/attachment-0003.obj>