[rabbitmq-discuss] Help understanding a crash report

Thu Feb 21 20:41:41 GMT 2013

Hi,

I am relatively new to Rabbitmq and would appreciate help in
troubleshooting a recurring issue on a cluster, apologies for the long
email.

I run a 3 instances cluster in ec2, 2 disk nodes (A and B) 1 ram node (C),
exchanges and queues are static and limited in number (less than 50), the
volume of messages can reach a few thousands per second and the queues can
occasionally grow up to a few hundred thousands until the processes manage
to catch up, but this is well within our memory/disk high watermark.
Rabbitmq is v. 2.8.4 on ubuntu 12.

The processes that produce and consume messages are java/scala and use the
akka-amqp library driver (akka-amqp-1.3.1).

In normal operations our nodes run on ~ 500MB memory, the limit is set to
2.7GB, but in a couple occasions one of the nodes managed to very rapidly
(< 1 min) eat > 1GB memory and triggered the high memory watermark (node B,
one of the disk nodes last time).

At the moment of the issue most of the memory was held in 'processes', no
queue was overgrown (messages or memory), I managed to run a rabbitmqctl
report, I found a crash report in the logs (more on this later) and tried
to see which process/es was responsible for the memory usage from the
erlang shell but I wasn't able to pin a specific one (my erlang is limited
and I didn't dump the processes to file to look at it later unfortunately).

For every issue we have had (3 in 2 months), the logs show a crash report
relative to a specific java process (say process D) sending a message to a
specific queue (queue D - this is a queue with no consumers atm, if the app
fails to process a message it will end up here). Note that there are
multiple workers D running on separate instances in ec2 and that queue D is
very low traffic so it is unlikely to be a coincidence.

At this point I was pressured to restart rmq on the affected node B, this
was a stop_app, then stop the erlang VM and then restart. All looked good,
the node was in a nice place and all looked green in the management
console.

Our java processes would connect, they'd do some work, but after a while
they'd throw a connection error (sorry this is all the details the logs
would show). Specifically a process E was connected to node A and would
show this random behaviour (different process and different rmq node from
the two involved in the issue).
After a while I was forced to restart the whole cluster, this was done
stopping app on the ram node C, then on the disk node B, then the other
node A. Unfortunately A seemed to hang and failed to stop within a
reasonable time (minutes), and I had to kill it (kill -15). I didn't see
any indication in the logs on why it was hanging.

When I attempted to start A it would fail as the port was already in use
and I found that a process still had an open connection to port 5672:

tcp6       0 164945 10.xx.xx.xx:5672        10.xx.xx.xx:53712      FIN_WAIT1

Note: This is a worker D but running on a different box from the one showed
in the crash log.

Once this connection was cleared everything came up nice and happy.

In Summary:

Process D was producing on queue D on node B, something happened and node B
managed to consume > 1GB memory in less than a minute. Note that this queue
is really low traffic but in every issue I have looked into I can see a
similar crash report as below.

After a restart of node B other processes connected to node A started
exhibiting connection issues, work a bit then break the connection until
the whole cluster is restarted.

I would like to better understand the crash report and perhaps have some
ideas on what went wrong and how to more effectively troubleshoot issues
(what more info should I collect before restarting the nodes, erlang
processes list, mnesia tables, ets tables etc).

=CRASH REPORT==== 14-Feb-2013::10:36:54 ===
  crasher:
    initial call: rabbit_reader:init/4
    pid: <*0.29283.387*>
    registered_name: []
    exception error: bad argument
      in function  port_close/1
         called as port_close(#Port<0.746540>)
      in call from rabbit_net:maybe_fast_close/1
      in call from rabbit_reader:start_connection/7
    ancestors: [<0.29280.387>,rabbit_tcp_client_sup,rabbit_sup,<0.161.0>]
    messages: []
    links: [<0.29280.387>]
    dictionary: [{{channel,10},
                   {<0.29364.387>,{method,rabbit_framing_amqp_0_9_1}}},
                  {{ch_pid,<0.29338.387>},{7,#Ref<0.0.2158.60093>}},
                  {{ch_pid,<0.29333.387>},{6,#Ref<0.0.2158.60085>}},
                  {{ch_pid,<0.29325.387>},{5,#Ref<0.0.2158.60053>}},
                  {{channel,3},
                   {<0.29313.387>,{method,rabbit_framing_amqp_0_9_1}}},
                  {{ch_pid,<0.29305.387>},{2,#Ref<0.0.2158.60002>}},
                  {{channel,4},
                   {<0.29321.387>,{method,rabbit_framing_amqp_0_9_1}}},
                  {{channel,11},
                   {<0.29370.387>,{method,rabbit_framing_amqp_0_9_1}}},
                  {{ch_pid,<0.29313.387>},{3,#Ref<0.0.2158.60017>}},
                  {{ch_pid,<0.29299.387>},{1,#Ref<0.0.2158.59976>}},
                  {{ch_pid,<0.29346.387>},{8,#Ref<0.0.2158.60112>}},
                  {{ch_pid,<0.29370.387>},{11,#Ref<0.0.2158.60189>}},
                  {{channel,7},
                   {<0.29338.387>,{method,rabbit_framing_amqp_0_9_1}}},
                  {{channel,9},
                   {<0.29356.387>,{method,rabbit_framing_amqp_0_9_1}}},
                  {{ch_pid,<0.29321.387>},{4,#Ref<0.0.2158.60034>}},
                  {{ch_pid,<0.29364.387>},{10,#Ref<0.0.2158.60166>}},
                  {{ch_pid,<0.29356.387>},{9,#Ref<0.0.2158.60140>}},
                  {{channel,8},
                   {<0.29346.387>,{method,rabbit_framing_amqp_0_9_1}}},
                  {{channel,5},
                   {<0.29325.387>,{method,rabbit_framing_amqp_0_9_1}}},
                  {{channel,1},
                   {<0.29299.387>,
                    {content_body,
                        {'basic.publish',0,<<"some_exchange">>,<<>>,false,
                            false},
                        1048189,
                        {content,60,none,
                            <<BYTES IN HERE>>,   --> this showed which
process was sending the message
                            rabbit_framing_amqp_0_9_1,
                            [<<MORE BYTES IN HERE>>]  --> This I haven't
been able to decode, it is fairly big, is it truncated?

And in the logs we can find the pid *0.29283.387 *right before the crash:

=INFO REPORT==== 14-Feb-2013::10:31:46 ===
accepting AMQP connection <*0.29283.387*> (10.xx.xx.xx:58622 -> 10.xx.xx.xx
:5672)

=INFO REPORT==== 14-Feb-2013::10:31:46 ===
accepting AMQP connection <0.29287.387> (10.xx.xx.xx:58623 -> 10.xx.xx.xx
:5672)

=WARNING REPORT==== 14-Feb-2013::10:32:27 ===
closing AMQP connection <0.27107.387> (10.xx.xx.xx:50882 -> 10.xx.xx.xx
:5672):
connection_closed_abruptly

=WARNING REPORT==== 14-Feb-2013::10:32:27 ===
closing AMQP connection <0.27111.387> (10.xx.xx.xx:50883 -> 10.xx.xx.xx
:5672):
connection_closed_abruptly

=INFO REPORT==== 14-Feb-2013::10:33:35 ===
accepting AMQP connection <0.30447.387> (10.xx.xx.xx:50187 -> 10.xx.xx.xx
:5672)

=INFO REPORT==== 14-Feb-2013::10:33:35 ===
accepting AMQP connection <0.30451.387> (10.xx.xx.xx:50188 -> 10.xx.xx.xx
:5672)

=WARNING REPORT==== 14-Feb-2013::10:37:34 ===
closing AMQP connection <*0.29283.387*> (10.xx.xx.xx:58622 -> 10.xx.xx.xx
:5672):
connection_closed_abruptly

=WARNING REPORT==== 14-Feb-2013::10:37:34 ===
closing AMQP connection <0.29287.387> (10.xx.xx.xx:58623 -> 10.xx.xx.xx
:5672):
connection_closed_abruptly

=WARNING REPORT==== 14-Feb-2013::10:37:34 ===
closing AMQP connection <0.30451.387> (10.xx.xx.xx:50188 -> 10.xx.xx.xx
:5672):
connection_closed_abruptly

=WARNING REPORT==== 14-Feb-2013::10:37:34 ===
closing AMQP connection <0.30447.387> (10.xx.xx.xx:50187 -> 10.xx.xx.xx
:5672):
connection_closed_abruptly

=INFO REPORT==== 14-Feb-2013::10:37:34 ===
vm_memory_high_watermark set. Memory used:3384669192 allowed:2914218803

Looking at the rabbitmqctl report I have not been able to map the memory
consumption to something specific yet.

Any help appreciated,

S
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130221/e64d4bf7/attachment.htm>