[rabbitmq-discuss] Crash with RabbitMQ 3.1.5

Wed Oct 16 15:14:12 BST 2013

Hi there,

Hoping someone can help me out.  We recently experienced 2 crashes with our
RabbitMQ cluster.  After the first crash, we moved the Mnesia directories
elsewhere, and started RabbitMQ again.  This got us up and running.  Second
time it happened, we had the original nodes plus an additional 5 nodes we
had added to the cluster that we were planning to leave in place while
shutting the old nodes down.

During the crash symptoms were as follows:

- Escalating (and sudden) CPU utilisation on some (but not all) nodes
- Escalation memory usage (not necessarily aligned to the spiking CPU)
- Increasing time to publish on queues (and specifically on a test queue we
have setup that exists only to test publishing and consuming from the
cluster hosts)
- Running `rabbitmqctl cluster status` gets increasingly slow (some nodes
eventually taking up to 10m to return with the response data - some were
fast and took 5s)
- Management plugin stops responding / or responding so slowly it's no
longer loading any data at all (probably same thing that causes the
preceeding item)
- Can't force nodes to forget other nodes (calling `rabbitmqctl
forget_cluster_node` doesn't return)

- When trying to shut down a node, running `rabbitmqctl stop_app` appears
to block on epmd and doesn't return
--- When that doesn't return we eventually have to ctrl-c the command
--- We have to issue a kill signal to rabbit to stop it
--- Do the same to the epmd process
--- However the other nodes all still think that the killed node is active
(based on `rabbitmqctl cluster status` -- both nodes slow to run this, and
fast to run it saw the same view of the cluster that included the dead node)

Config / details as follows (we use mirrored queues -- 5 hosts, all disc
nodes, with a global policy that all queues are mirrored "ha-mode:all"),
running on Linux

[
        {rabbit, [
                {cluster_nodes, {['rabbit at b05.internal',
'rabbit at b06.internal','rabbit at b07.internal','rabbit at b08.internal
','rabbit at b09.internal'], disc}},
                {cluster_partition_handling, pause_minority}
        ]}
]

And the env:

NODENAME="rabbit at b09.internal"
SERVER_ERL_ARGS="-kernel inet_dist_listen_min 27248 -kernel
inet_dist_listen_max 27248"

The erl_crash.dump slogan error was : eheap_alloc: Cannot allocate
2850821240 bytes of memory (of type "old_heap").

System version : Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2]
[rq:2] [async-threads:0] [kernel-poll:false]

Compiled Fri Dec 16 03:22:15 2011
Taints (none)
Memory allocated 6821368760 bytes
Atoms 22440
Processes 4899
ETS tables 80
Timers 23
Funs 3994

When I look at the Process Information it seems there's a small number with
ALOT of messages queued, and the rest are an order of magnitude lower:

Pid Name/Spawned as State Reductions Stack+heap MsgQ Length
<0.400.0> proc_lib:init_p/5 Scheduled 146860259 59786060 37148
<0.373.0> proc_lib:init_p/5 Scheduled 734287949 1346269 23360
<0.366.0> proc_lib:init_p/5 Waiting 114695635 5135590 19744
<0.444.0> proc_lib:init_p/5 Waiting 154538610 832040 3326

when I view the second process (first one crashes erlang on me), I see a
large number of sender_death events (not sure if these are common or highly
unusual ?)

{'$gen_cast',{gm,{sender_death,<2710.20649.64>}}}

mixed in with other more regular events:

{'$gen_cast',
    {gm,{publish,<2708.20321.59>,
            {message_properties,undefined,false},
            {basic_message,
<.. snip..>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131017/16a02719/attachment.htm>