Hi,<div><br></div><div>I am relatively new to Rabbitmq and would appreciate help in troubleshooting a recurring issue on a cluster, apologies for the long email.</div><div><br></div><div>I run a 3 instances cluster in ec2, 2 disk nodes (A and B) 1 ram node (C), exchanges and queues are static and limited in number (less than 50), the volume of messages can reach a few thousands per second and the queues can occasionally grow up to a few hundred thousands until the processes manage to catch up, but this is well within our memory/disk high watermark. Rabbitmq is v. 2.8.4 on ubuntu 12.</div>
<div><br></div><div>The processes that produce and consume messages are java/scala and use the akka-amqp library driver (akka-amqp-1.3.1).</div><div><br></div><div>In normal operations our nodes run on ~ 500MB memory, the limit is set to 2.7GB, but in a couple occasions one of the nodes managed to very rapidly (< 1 min) eat > 1GB memory and triggered the high memory watermark (node B, one of the disk nodes last time).</div>
<div><br></div><div>At the moment of the issue most of the memory was held in 'processes', no queue was overgrown (messages or memory), I managed to run a rabbitmqctl report, I found a crash report in the logs (more on this later) and tried to see which process/es was responsible for the memory usage from the erlang shell but I wasn't able to pin a specific one (my erlang is limited and I didn't dump the processes to file to look at it later unfortunately). </div>
<div><br></div><div>For every issue we have had (3 in 2 months), the logs show a crash report relative to a specific java process (say process D) sending a message to a specific queue (queue D - this is a queue with no consumers atm, if the app fails to process a message it will end up here). Note that there are multiple workers D running on separate instances in ec2 and that queue D is very low traffic so it is unlikely to be a coincidence.</div>
<div><br></div><div>At this point I was pressured to restart rmq on the affected node B, this was a stop_app, then stop the erlang VM and then restart. All looked good, the node was in a nice place and all looked green in the management console. </div>
<div><br></div><div>Our java processes would connect, they'd do some work, but after a while they'd throw a connection error (sorry this is all the details the logs would show). Specifically a process E was connected to node A and would show this random behaviour (different process and different rmq node from the two involved in the issue). </div>
<div>After a while I was forced to restart the whole cluster, this was done stopping app on the ram node C, then on the disk node B, then the other node A. Unfortunately A seemed to hang and failed to stop within a reasonable time (minutes), and I had to kill it (kill -15). I didn't see any indication in the logs on why it was hanging.</div>
<div><br></div><div>When I attempted to start A it would fail as the port was already in use and I found that a process still had an open connection to port 5672:</div><div><br></div><div>tcp6 0 164945 10.xx.xx.xx:5672 10.xx.xx.xx:53712 FIN_WAIT1</div>
<div><br></div><div>Note: This is a worker D but running on a different box from the one showed in the crash log.</div><div><br></div><div>Once this connection was cleared everything came up nice and happy.</div><div><br>
</div><div>In Summary:</div><div><br></div><div>Process D was producing on queue D on node B, something happened and node B managed to consume > 1GB memory in less than a minute. Note that this queue is really low traffic but in every issue I have looked into I can see a similar crash report as below.</div>
<div><br></div><div>After a restart of node B other processes connected to node A started exhibiting connection issues, work a bit then break the connection until the whole cluster is restarted.</div><div><br></div><div>
I would like to better understand the crash report and perhaps have some ideas on what went wrong and how to more effectively troubleshoot issues (what more info should I collect before restarting the nodes, erlang processes list, mnesia tables, ets tables etc).</div>
<div><br></div><div><br></div><div><div>=CRASH REPORT==== 14-Feb-2013::10:36:54 ===</div><div> crasher:</div><div> initial call: rabbit_reader:init/4</div><div><span> pid: <</span><b>0.29283.387</b><span>></span></div>
<div> registered_name: []</div><div> exception error: bad argument</div><div> in function port_close/1</div><div> called as port_close(#Port<0.746540>)</div><div> in call from rabbit_net:maybe_fast_close/1</div>
<div> in call from rabbit_reader:start_connection/7</div><div> ancestors: [<0.29280.387>,rabbit_tcp_client_sup,rabbit_sup,<0.161.0>]</div><div> messages: []</div><div> links: [<0.29280.387>]</div>
<div> dictionary: [{{channel,10},</div><div> {<0.29364.387>,{method,rabbit_framing_amqp_0_9_1}}},</div><div> {{ch_pid,<0.29338.387>},{7,#Ref<0.0.2158.60093>}},</div>
<div> {{ch_pid,<0.29333.387>},{6,#Ref<0.0.2158.60085>}},</div><div> {{ch_pid,<0.29325.387>},{5,#Ref<0.0.2158.60053>}},</div><div> {{channel,3},</div>
<div> {<0.29313.387>,{method,rabbit_framing_amqp_0_9_1}}},</div><div> {{ch_pid,<0.29305.387>},{2,#Ref<0.0.2158.60002>}},</div><div> {{channel,4},</div>
<div> {<0.29321.387>,{method,rabbit_framing_amqp_0_9_1}}},</div><div> {{channel,11},</div><div> {<0.29370.387>,{method,rabbit_framing_amqp_0_9_1}}},</div><div>
{{ch_pid,<0.29313.387>},{3,#Ref<0.0.2158.60017>}},</div><div> {{ch_pid,<0.29299.387>},{1,#Ref<0.0.2158.59976>}},</div><div> {{ch_pid,<0.29346.387>},{8,#Ref<0.0.2158.60112>}},</div>
<div> {{ch_pid,<0.29370.387>},{11,#Ref<0.0.2158.60189>}},</div><div> {{channel,7},</div><div> {<0.29338.387>,{method,rabbit_framing_amqp_0_9_1}}},</div>
<div> {{channel,9},</div><div> {<0.29356.387>,{method,rabbit_framing_amqp_0_9_1}}},</div><div> {{ch_pid,<0.29321.387>},{4,#Ref<0.0.2158.60034>}},</div>
<div> {{ch_pid,<0.29364.387>},{10,#Ref<0.0.2158.60166>}},</div><div> {{ch_pid,<0.29356.387>},{9,#Ref<0.0.2158.60140>}},</div><div> {{channel,8},</div>
<div> {<0.29346.387>,{method,rabbit_framing_amqp_0_9_1}}},</div><div> {{channel,5},</div><div> {<0.29325.387>,{method,rabbit_framing_amqp_0_9_1}}},</div><div>
{{channel,1},</div><div> {<0.29299.387>,</div><div> {content_body,</div><div> {'basic.publish',0,<<"some_exchange">>,<<>>,false,</div>
<div> false},</div><div> 1048189,</div><div> {content,60,none,</div><div> <<BYTES IN HERE>>, --> this showed which process was sending the message</div>
<div> rabbit_framing_amqp_0_9_1,</div><div> [<<MORE BYTES IN HERE>>] --> This I haven't been able to decode, it is fairly big, is it truncated?</div>
</div><div><br></div><div><br></div><div>And in the logs we can find the pid <b>0.29283.387 </b>right before the crash:</div><div><br></div><div>=INFO REPORT==== 14-Feb-2013::10:31:46 ===</div><div><span>accepting AMQP connection <<b>0.29283.387</b>> (10.xx.xx.xx:58622 -> 10.</span>xx.xx.xx<span>:5672)</span></div>
<div><br></div><div>=INFO REPORT==== 14-Feb-2013::10:31:46 ===</div><div><span>accepting AMQP connection <0.29287.387> (10.</span>xx.xx.xx<span>:58623 -> 10.</span>xx.xx.xx<span>:5672)</span></div><div><br></div>
<div>=WARNING REPORT==== 14-Feb-2013::10:32:27 ===</div><div><span>closing AMQP connection <0.27107.387> (10.</span>xx.xx.xx<span>:50882 -> 10.</span>xx.xx.xx<span>:5672):</span></div><div>connection_closed_abruptly</div>
<div><br></div><div>=WARNING REPORT==== 14-Feb-2013::10:32:27 ===</div><div><span>closing AMQP connection <0.27111.387> (10.</span>xx.xx.xx<span>:50883 -> 10.</span>xx.xx.xx<span>:5672):</span></div><div>connection_closed_abruptly</div>
<div><br></div><div>=INFO REPORT==== 14-Feb-2013::10:33:35 ===</div><div><span>accepting AMQP connection <0.30447.387> (10.</span>xx.xx.xx<span>:50187 -> 10.</span>xx.xx.xx<span>:5672)</span></div><div><br></div>
<div>=INFO REPORT==== 14-Feb-2013::10:33:35 ===</div><div><span>accepting AMQP connection <0.30451.387> (10.</span>xx.xx.xx<span>:50188 -> 10.</span>xx.xx.xx<span>:5672)</span></div><div><br></div><div>=WARNING REPORT==== 14-Feb-2013::10:37:34 ===</div>
<div><span>closing AMQP connection <<b>0.29283.387</b>> (10.</span>xx.xx.xx<span>:58622 -> 10.</span>xx.xx.xx<span>:5672):</span></div><div>connection_closed_abruptly</div><div><br></div><div>=WARNING REPORT==== 14-Feb-2013::10:37:34 ===</div>
<div><span>closing AMQP connection <0.29287.387> (10.</span>xx.xx.xx<span>:58623 -> 10.</span>xx.xx.xx<span>:5672):</span></div><div>connection_closed_abruptly</div><div><br></div><div>=WARNING REPORT==== 14-Feb-2013::10:37:34 ===</div>
<div><span>closing AMQP connection <0.30451.387> (10.</span>xx.xx.xx<span>:50188 -> 10.</span>xx.xx.xx<span>:5672):</span></div><div>connection_closed_abruptly</div><div><br></div><div>=WARNING REPORT==== 14-Feb-2013::10:37:34 ===</div>
<div><span>closing AMQP connection <0.30447.387> (10.</span>xx.xx.xx<span>:50187 -> 10.</span>xx.xx.xx<span>:5672):</span></div><div>connection_closed_abruptly</div><div><br></div><div>=INFO REPORT==== 14-Feb-2013::10:37:34 ===</div>
<div>vm_memory_high_watermark set. Memory used:3384669192 allowed:2914218803</div><div><br></div><div>Looking at the rabbitmqctl report I have not been able to map the memory consumption to something specific yet.<br></div>
<div><br></div><div>Any help appreciated,</div><div><br></div><div>S</div>