[rabbitmq-discuss] Understanding a crash report

Tue Feb 19 12:02:52 GMT 2013

Hi,

I am relatively new to Rabbitmq and would appreciate help in 
troubleshooting a recurring issue on a cluster, apologies for the long 
email.

I run a 3 instances cluster in ec2, 2 disk nodes (A and B) 1 ram node (C), 
exchanges and queues are static and limited in number (less than 50), the 
volume of messages can reach a few thousands per second and the queues can 
occasionally grow up to a few hundred thousands until the processes manage 
to catch up, but this is well within our memory/disk high watermark. 
Rabbitmq is v. 2.8.4 on ubuntu 12.

The processes that produce and consume messages are java/scala and use the 
akka-amqp library driver (akka-amqp-1.3.1).

In normal operations our nodes run on ~ 500MB memory, the limit is set to 
2.7GB, but in a couple occasions one of the nodes managed to very rapidly 
(< 1 min) eat > 1GB memory and triggered the high memory watermark (node B, 
one of the disk nodes last time).

At the moment of the issue most of the memory was held in 'processes', no 
queue was overgrown (messages or memory), I managed to run a rabbitmqctl 
report, I found a crash report in the logs (more on this later) and tried 
to see which process/es was responsible for the memory usage from the 
erlang shell but I wasn't able to pin a specific one (my erlang is limited 
and I didn't dump the processes to file to look at it later unfortunately). 

For every issue we have had (3 in 2 months), the logs show a crash report 
relative to a specific java process (say process D) sending a message to a 
specific queue (queue D - this is a queue with no consumers atm, if the app 
fails to process a message it will end up here). Note that there are 
multiple workers D running on separate instances in ec2 and that queue D is 
very low traffic so it is unlikely to be a coincidence.

At this point I was pressured to restart rmq on the affected node B, this 
was a stop_app, then stop the erlang VM and then restart. All looked good, 
the node was in a nice place and all looked green in the management 
console. 

Our java processes would connect, they'd do some work, but after a while 
they'd throw a connection error (sorry this is all the details the logs 
would show). Specifically a process E was connected to node A and would 
show this random behaviour (different process and different rmq node from 
the two involved in the issue). 
After a while I was forced to restart the whole cluster, this was done 
stopping app on the ram node C, then on the disk node B, then the other 
node A. Unfortunately A seemed to hang and failed to stop within a 
reasonable time (minutes), and I had to kill it (kill -15). I didn't see 
any indication in the logs on why it was hanging.

When I attempted to start A it would fail as the port was already in use 
and I found that a process still had an open connection to port 5672:

tcp6       0 164945 10.xx.xx.xx:5672        10.xx.xx.xx:53712      FIN_WAIT1

Note: This is a worker D but running on a different box from the one showed 
in the crash log.

Once this connection was cleared everything came up nice and happy.

In Summary:

Process D was producing on queue D on node B, something happened and node B 
managed to consume > 1GB memory in less than a minute. Note that this queue 
is really low traffic but in every issue I have looked into I can see a 
similar crash report as below.

After a restart of node B other processes connected to node A started 
exhibiting connection issues, work a bit then break the connection until 
the whole cluster is restarted.

I would like to better understand the crash report and perhaps have some 
ideas on what went wrong and how to more effectively troubleshoot issues 
(what more info should I collect before restarting the nodes, erlang 
processes list, mnesia tables, ets tables etc).

=CRASH REPORT==== 14-Feb-2013::10:36:54 ===
  crasher:
    initial call: rabbit_reader:init/4
    pid: <*0.29283.387*>
    registered_name: []
    exception error: bad argument
      in function  port_close/1
         called as port_close(#Port<0.746540>)
      in call from rabbit_net:maybe_fast_close/1
      in call from rabbit_reader:start_connection/7
    ancestors: [<0.29280.387>,rabbit_tcp_client_sup,rabbit_sup,<0.161.0>]
    messages: []
    links: [<0.29280.387>]
    dictionary: [{{channel,10},
                   {<0.29364.387>,{method,rabbit_framing_amqp_0_9_1}}},
                  {{ch_pid,<0.29338.387>},{7,#Ref<0.0.2158.60093>}},
                  {{ch_pid,<0.29333.387>},{6,#Ref<0.0.2158.60085>}},
                  {{ch_pid,<0.29325.387>},{5,#Ref<0.0.2158.60053>}},
                  {{channel,3},
                   {<0.29313.387>,{method,rabbit_framing_amqp_0_9_1}}},
                  {{ch_pid,<0.29305.387>},{2,#Ref<0.0.2158.60002>}},
                  {{channel,4},
                   {<0.29321.387>,{method,rabbit_framing_amqp_0_9_1}}},
                  {{channel,11},
                   {<0.29370.387>,{method,rabbit_framing_amqp_0_9_1}}},
                  {{ch_pid,<0.29313.387>},{3,#Ref<0.0.2158.60017>}},
                  {{ch_pid,<0.29299.387>},{1,#Ref<0.0.2158.59976>}},
                  {{ch_pid,<0.29346.387>},{8,#Ref<0.0.2158.60112>}},
                  {{ch_pid,<0.29370.387>},{11,#Ref<0.0.2158.60189>}},
                  {{channel,7},
                   {<0.29338.387>,{method,rabbit_framing_amqp_0_9_1}}},
                  {{channel,9},
                   {<0.29356.387>,{method,rabbit_framing_amqp_0_9_1}}},
                  {{ch_pid,<0.29321.387>},{4,#Ref<0.0.2158.60034>}},
                  {{ch_pid,<0.29364.387>},{10,#Ref<0.0.2158.60166>}},
                  {{ch_pid,<0.29356.387>},{9,#Ref<0.0.2158.60140>}},
                  {{channel,8},
                   {<0.29346.387>,{method,rabbit_framing_amqp_0_9_1}}},
                  {{channel,5},
                   {<0.29325.387>,{method,rabbit_framing_amqp_0_9_1}}},
                  {{channel,1},
                   {<0.29299.387>,
                    {content_body,
                        {'basic.publish',0,<<"some_exchange">>,<<>>,false,
                            false},
                        1048189,
                        {content,60,none,
                            <<BYTES IN HERE>>,   --> this showed which 
process was sending the message
                            rabbit_framing_amqp_0_9_1,
                            [<<MORE BYTES IN HERE>>]  --> This I haven't 
been able to decode, it is fairly big, is it truncated?

And in the logs we can find the pid *0.29283.387 *right before the crash:

=INFO REPORT==== 14-Feb-2013::10:31:46 ===
accepting AMQP connection <*0.29283.387*> (10.xx.xx.xx:58622 -> 10.xx.xx.xx
:5672)

=INFO REPORT==== 14-Feb-2013::10:31:46 ===
accepting AMQP connection <0.29287.387> (10.xx.xx.xx:58623 -> 10.xx.xx.xx
:5672)

=WARNING REPORT==== 14-Feb-2013::10:32:27 ===
closing AMQP connection <0.27107.387> (10.xx.xx.xx:50882 -> 10.xx.xx.xx
:5672):
connection_closed_abruptly

=WARNING REPORT==== 14-Feb-2013::10:32:27 ===
closing AMQP connection <0.27111.387> (10.xx.xx.xx:50883 -> 10.xx.xx.xx
:5672):
connection_closed_abruptly

=INFO REPORT==== 14-Feb-2013::10:33:35 ===
accepting AMQP connection <0.30447.387> (10.xx.xx.xx:50187 -> 10.xx.xx.xx
:5672)

=INFO REPORT==== 14-Feb-2013::10:33:35 ===
accepting AMQP connection <0.30451.387> (10.xx.xx.xx:50188 -> 10.xx.xx.xx
:5672)

=WARNING REPORT==== 14-Feb-2013::10:37:34 ===
closing AMQP connection <*0.29283.387*> (10.xx.xx.xx:58622 -> 10.xx.xx.xx
:5672):
connection_closed_abruptly

=WARNING REPORT==== 14-Feb-2013::10:37:34 ===
closing AMQP connection <0.29287.387> (10.xx.xx.xx:58623 -> 10.xx.xx.xx
:5672):
connection_closed_abruptly

=WARNING REPORT==== 14-Feb-2013::10:37:34 ===
closing AMQP connection <0.30451.387> (10.xx.xx.xx:50188 -> 10.xx.xx.xx
:5672):
connection_closed_abruptly

=WARNING REPORT==== 14-Feb-2013::10:37:34 ===
closing AMQP connection <0.30447.387> (10.xx.xx.xx:50187 -> 10.xx.xx.xx
:5672):
connection_closed_abruptly

=INFO REPORT==== 14-Feb-2013::10:37:34 ===
vm_memory_high_watermark set. Memory used:3384669192 allowed:2914218803

Looking at the rabbitmqctl report I have not been able to map the memory 
consumption to something specific yet.

Any help appreciated,

S

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130219/fea3fdeb/attachment.htm>