Hi,<div><br></div><div>I am relatively new to Rabbitmq and would appreciate help in troubleshooting a recurring issue on a cluster, apologies for the long email.</div><div><br></div><div>I run a 3 instances cluster in ec2, 2 disk nodes (A and B) 1 ram node (C), exchanges and queues are static and limited in number (less than 50), the volume of messages can reach a few thousands per second and the queues can occasionally grow up to a few hundred thousands until the processes manage to catch up, but this is well within our memory/disk high watermark. Rabbitmq is v. 2.8.4 on ubuntu 12.</div>
<div><br></div><div>The processes that produce and consume messages are java/scala and use the akka-amqp library driver (akka-amqp-1.3.1).</div><div><br></div><div>In normal operations our nodes run on ~ 500MB memory, the limit is set to 2.7GB, but in a couple occasions one of the nodes managed to very rapidly (< 1 min) eat > 1GB memory and triggered the high memory watermark (node B, one of the disk nodes last time).</div>
<div><br></div><div>At the moment of the issue most of the memory was held in 'processes', no queue was overgrown (messages or memory), I managed to run a rabbitmqctl report, I found a crash report in the logs (more on this later) and tried to see which process/es was responsible for the memory usage from the erlang shell but I wasn't able to pin a specific one (my erlang is limited and I didn't dump the processes to file to look at it later unfortunately).�</div>
<div><br></div><div>For every issue we have had (3 in 2 months), the logs show a crash report relative to a specific java process (say process D) sending a message to a specific queue (queue D - this is a queue with no consumers atm, if the app fails to process a message it will end up here). Note that there are multiple workers D running on separate instances in ec2 and that queue D is very low traffic so it is unlikely to be a coincidence.</div>
<div><br></div><div>At this point I was pressured to restart rmq on the affected node B, this was a stop_app, then stop the erlang VM and then restart. All looked good, the node was in a nice place and all looked green in the management console.�</div>
<div><br></div><div>Our java processes would connect, they'd do some work, but after a while they'd throw a connection error (sorry this is all the details the logs would show). Specifically a process E was connected to node A and would show this random behaviour (different process and different rmq node from the two involved in the issue).�</div>
<div>After a while I was forced to restart the whole cluster, this was done stopping app on the ram node C, then on the disk node B, then the other node A. Unfortunately A seemed to hang and failed to stop within a reasonable time (minutes), and I had to kill it (kill -15). I didn't see any indication in the logs on why it was hanging.</div>
<div><br></div><div>When I attempted to start A it would fail as the port was already in use and I found that a process still had an open connection to port 5672:</div><div><br></div><div>tcp6 � � � 0 164945 10.xx.xx.xx:5672 � � � �10.xx.xx.xx:53712 � � �FIN_WAIT1</div>
<div><br></div><div>Note: This is a worker D but running on a different box from the one showed in the crash log.</div><div><br></div><div>Once this connection was cleared everything came up nice and happy.</div><div><br>
</div><div>In Summary:</div><div><br></div><div>Process D was producing on queue D on node B, something happened and node B managed to consume > 1GB memory in less than a minute.�Note that this queue is really low traffic but in every issue I have looked into I can see a similar crash report as below.</div>
<div><br></div><div>After a restart of node B other processes connected to node A started exhibiting connection issues, work a bit then break the connection until the whole cluster is restarted.</div><div><br></div><div>
I would like to better understand the crash report and perhaps have some ideas on what went wrong and how to more effectively troubleshoot issues (what more info should I collect before restarting the nodes, erlang processes list, mnesia tables, ets tables etc).</div>
<div><br></div><div><br></div><div><div>=CRASH REPORT==== 14-Feb-2013::10:36:54 ===</div><div>� crasher:</div><div>� � initial call: rabbit_reader:init/4</div><div><span>� � pid: <</span><b>0.29283.387</b><span>></span></div>
<div>� � registered_name: []</div><div>� � exception error: bad argument</div><div>� � � in function �port_close/1</div><div>� � � � �called as port_close(#Port<0.746540>)</div><div>� � � in call from rabbit_net:maybe_fast_close/1</div>
<div>� � � in call from rabbit_reader:start_connection/7</div><div>� � ancestors: [<0.29280.387>,rabbit_tcp_client_sup,rabbit_sup,<0.161.0>]</div><div>� � messages: []</div><div>� � links: [<0.29280.387>]</div>
<div>� � dictionary: [{{channel,10},</div><div>� � � � � � � � � �{<0.29364.387>,{method,rabbit_framing_amqp_0_9_1}}},</div><div>� � � � � � � � � {{ch_pid,<0.29338.387>},{7,#Ref<0.0.2158.60093>}},</div>
<div>� � � � � � � � � {{ch_pid,<0.29333.387>},{6,#Ref<0.0.2158.60085>}},</div><div>� � � � � � � � � {{ch_pid,<0.29325.387>},{5,#Ref<0.0.2158.60053>}},</div><div>� � � � � � � � � {{channel,3},</div>
<div>� � � � � � � � � �{<0.29313.387>,{method,rabbit_framing_amqp_0_9_1}}},</div><div>� � � � � � � � � {{ch_pid,<0.29305.387>},{2,#Ref<0.0.2158.60002>}},</div><div>� � � � � � � � � {{channel,4},</div>
<div>� � � � � � � � � �{<0.29321.387>,{method,rabbit_framing_amqp_0_9_1}}},</div><div>� � � � � � � � � {{channel,11},</div><div>� � � � � � � � � �{<0.29370.387>,{method,rabbit_framing_amqp_0_9_1}}},</div><div>
� � � � � � � � � {{ch_pid,<0.29313.387>},{3,#Ref<0.0.2158.60017>}},</div><div>� � � � � � � � � {{ch_pid,<0.29299.387>},{1,#Ref<0.0.2158.59976>}},</div><div>� � � � � � � � � {{ch_pid,<0.29346.387>},{8,#Ref<0.0.2158.60112>}},</div>
<div>� � � � � � � � � {{ch_pid,<0.29370.387>},{11,#Ref<0.0.2158.60189>}},</div><div>� � � � � � � � � {{channel,7},</div><div>� � � � � � � � � �{<0.29338.387>,{method,rabbit_framing_amqp_0_9_1}}},</div>
<div>� � � � � � � � � {{channel,9},</div><div>� � � � � � � � � �{<0.29356.387>,{method,rabbit_framing_amqp_0_9_1}}},</div><div>� � � � � � � � � {{ch_pid,<0.29321.387>},{4,#Ref<0.0.2158.60034>}},</div>
<div>� � � � � � � � � {{ch_pid,<0.29364.387>},{10,#Ref<0.0.2158.60166>}},</div><div>� � � � � � � � � {{ch_pid,<0.29356.387>},{9,#Ref<0.0.2158.60140>}},</div><div>� � � � � � � � � {{channel,8},</div>
<div>� � � � � � � � � �{<0.29346.387>,{method,rabbit_framing_amqp_0_9_1}}},</div><div>� � � � � � � � � {{channel,5},</div><div>� � � � � � � � � �{<0.29325.387>,{method,rabbit_framing_amqp_0_9_1}}},</div><div>
� � � � � � � � � {{channel,1},</div><div>� � � � � � � � � �{<0.29299.387>,</div><div>� � � � � � � � � � {content_body,</div><div>� � � � � � � � � � � � {'basic.publish',0,<<"some_exchange">>,<<>>,false,</div>
<div>� � � � � � � � � � � � � � false},</div><div>� � � � � � � � � � � � 1048189,</div><div>� � � � � � � � � � � � {content,60,none,</div><div>� � � � � � � � � � � � � � <<BYTES IN HERE>>, � --> this showed which process was sending the message</div>
<div>� � � � � � � � � � � � � � rabbit_framing_amqp_0_9_1,</div><div>� � � � � � � � � � � � � � [<<MORE�BYTES IN HERE>>] �--> This I haven't been able to decode, it is fairly big, is it truncated?</div>
</div><div><br></div><div><br></div><div>And in the logs we can find the pid�<b>0.29283.387�</b>right before the crash:</div><div><br></div><div>=INFO REPORT==== 14-Feb-2013::10:31:46 ===</div><div><span>accepting AMQP connection <<b>0.29283.387</b>> (10.xx.xx.xx:58622 -> 10.</span>xx.xx.xx<span>:5672)</span></div>
<div><br></div><div>=INFO REPORT==== 14-Feb-2013::10:31:46 ===</div><div><span>accepting AMQP connection <0.29287.387> (10.</span>xx.xx.xx<span>:58623 -> 10.</span>xx.xx.xx<span>:5672)</span></div><div><br></div>
<div>=WARNING REPORT==== 14-Feb-2013::10:32:27 ===</div><div><span>closing AMQP connection <0.27107.387> (10.</span>xx.xx.xx<span>:50882 -> 10.</span>xx.xx.xx<span>:5672):</span></div><div>connection_closed_abruptly</div>
<div><br></div><div>=WARNING REPORT==== 14-Feb-2013::10:32:27 ===</div><div><span>closing AMQP connection <0.27111.387> (10.</span>xx.xx.xx<span>:50883 -> 10.</span>xx.xx.xx<span>:5672):</span></div><div>connection_closed_abruptly</div>
<div><br></div><div>=INFO REPORT==== 14-Feb-2013::10:33:35 ===</div><div><span>accepting AMQP connection <0.30447.387> (10.</span>xx.xx.xx<span>:50187 -> 10.</span>xx.xx.xx<span>:5672)</span></div><div><br></div>
<div>=INFO REPORT==== 14-Feb-2013::10:33:35 ===</div><div><span>accepting AMQP connection <0.30451.387> (10.</span>xx.xx.xx<span>:50188 -> 10.</span>xx.xx.xx<span>:5672)</span></div><div><br></div><div>=WARNING REPORT==== 14-Feb-2013::10:37:34 ===</div>
<div><span>closing AMQP connection <<b>0.29283.387</b>> (10.</span>xx.xx.xx<span>:58622 -> 10.</span>xx.xx.xx<span>:5672):</span></div><div>connection_closed_abruptly</div><div><br></div><div>=WARNING REPORT==== 14-Feb-2013::10:37:34 ===</div>
<div><span>closing AMQP connection <0.29287.387> (10.</span>xx.xx.xx<span>:58623 -> 10.</span>xx.xx.xx<span>:5672):</span></div><div>connection_closed_abruptly</div><div><br></div><div>=WARNING REPORT==== 14-Feb-2013::10:37:34 ===</div>
<div><span>closing AMQP connection <0.30451.387> (10.</span>xx.xx.xx<span>:50188 -> 10.</span>xx.xx.xx<span>:5672):</span></div><div>connection_closed_abruptly</div><div><br></div><div>=WARNING REPORT==== 14-Feb-2013::10:37:34 ===</div>
<div><span>closing AMQP connection <0.30447.387> (10.</span>xx.xx.xx<span>:50187 -> 10.</span>xx.xx.xx<span>:5672):</span></div><div>connection_closed_abruptly</div><div><br></div><div>=INFO REPORT==== 14-Feb-2013::10:37:34 ===</div>
<div>vm_memory_high_watermark set. Memory used:3384669192 allowed:2914218803</div><div><br></div><div>Looking at the rabbitmqctl report I have not been able to map the memory consumption to something specific yet.<br></div>
<div><br></div><div>Any help appreciated,</div><div><br></div><div>S</div>