[rabbitmq-discuss] Crash with RabbitMQ 3.1.5

Wed Oct 16 16:29:01 BST 2013

On 17 October 2013 01:29, Tim Watson <tim at rabbitmq.com> wrote:

> Hello David!
>
>
Hey Tim, thanks for replying so quickly!

> On 16 Oct 2013, at 15:14, David Harrison wrote:
> > Hoping someone can help me out.  We recently experienced 2 crashes with
> our RabbitMQ cluster.  After the first crash, we moved the Mnesia
> directories elsewhere, and started RabbitMQ again.  This got us up and
> running.  Second time it happened, we had the original nodes plus an
> additional 5 nodes we had added to the cluster that we were planning to
> leave in place while shutting the old nodes down.
> >
>
> What version of rabbit are you running, and how was it installed?
>

3.1.5, running on Ubuntu Precise, installed via deb package.

>
> > During the crash symptoms were as follows:
> >
> > - Escalating (and sudden) CPU utilisation on some (but not all) nodes
>
> We've fixed at least one bug with that symptom in recent releases.
>

I think 3.1.5 is the latest stable ??

>
> > - Increasing time to publish on queues (and specifically on a test queue
> we have setup that exists only to test publishing and consuming from the
> cluster hosts)
>
> Are there multiple publishers on the same connection/channel when this
> happens? It wouldn't be unusual, if the server was struggling, to see flow
> control kick in and affect publishers in this fashion.
>

Yes in some cases there would be, for our test queue there wouldn't be --
we saw up to 10s on the test queue though

>
> > - Running `rabbitmqctl cluster status` gets increasingly slow (some
> nodes eventually taking up to 10m to return with the response data - some
> were fast and took 5s)
>
> Wow, 10m is amazingly slow. Can you provide log files for this period of
> activity and problems?
>

I'll take a look, we saw a few "too many processes" messages,

"Generic server net_kernel terminating" followed by :

** Reason for termination ==
** {system_limit,[{erlang,spawn_opt,
[inet_tcp_dist,do_setup,
[<0.19.0>,'rabbit at b02.internal',normal,
'rabbit at b00.internal',longnames,7000],
[link,{priority,max}]]},
{net_kernel,setup,4},
{net_kernel,handle_call,3},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}

=ERROR REPORT==== 15-Oct-2013::16:07:10 ===
** gen_event handler rabbit_error_logger crashed.
** Was installed in error_logger
** Last event was: {error,<0.8.0>,{emulator,"~s~n",["Too many processes\n"]}}
** When handler state == {resource,<<"/">>,exchange,<<"amq.rabbitmq.log">>}
** Reason == {aborted,
                 {no_exists,
                     [rabbit_topic_trie_edge,
                      {trie_edge,
                          {resource,<<"/">>,exchange,<<"amq.rabbitmq.log">>},
                          root,"error"}]}}

=ERROR REPORT==== 15-Oct-2013::16:07:10 ===
Mnesia(nonode at nohost): ** ERROR ** mnesia_controller got unexpected
info: {'EXIT',
<0.97.0>,
shutdown}

=ERROR REPORT==== 15-Oct-2013::16:11:38 ===
Mnesia('rabbit at b00.internal'): ** ERROR ** mnesia_event got
{inconsistent_database, starting_partitioned_network,
'rabbit at b01.internal'}

> > - When trying to shut down a node, running `rabbitmqctl stop_app`
> appears to block on epmd and doesn't return
>
> Again, we've fixed bugs in that area in recent releases.
>
> > --- When that doesn't return we eventually have to ctrl-c the command
> > --- We have to issue a kill signal to rabbit to stop it
> > --- Do the same to the epmd process
>
> Even if you have to `kill -9' a rabbit node, you shouldn't need to kill
> epmd. In theory at least. If that was necessary to fix the "state of the
> world", it would be indicative of a problem related to the erlang
> distribution mechanism, but I very much doubt that's the case here.
>
> > Config / details as follows (we use mirrored queues -- 5 hosts, all disc
> nodes, with a global policy that all queues are mirrored "ha-mode:all"),
> running on Linux
> >
>
> How many queues are we talking about here?
>

~30

>
> > [
> >         {rabbit, [
> >                 {cluster_nodes, {['rabbit at b05.internal',
> 'rabbit at b06.internal','rabbit at b07.internal','rabbit at b08.internal
> ','rabbit at b09.internal'], disc}},
> >                 {cluster_partition_handling, pause_minority}
>
> Are you sure that what you're seeing is not caused by a network partition?
> If it were, any nodes in a minority island would "pause", which would
> certainly lead to the kind of symptoms you've mentioned here, viz
> rabbitmqctl calls not returning and so on.
>

There was definitely a network partition, but the whole cluster nose dived
during the crash

>
> > The erl_crash.dump slogan error was : eheap_alloc: Cannot allocate
> 2850821240 bytes of memory (of type "old_heap").
> >
>
> That's a plain old OOM failure. Rabbit ought to start deliberately paging
> messages to disk well before that happens, which might also explain a lot
> of the slow/unresponsive-ness.
>

These hosts aren't running swap, we give them a fair bit of RAM (gave them
even more now as part of a possible stop gap)

>
> > System version : Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2]
> [rq:2] [async-threads:0] [kernel-poll:false]
> >
>
> I'd strongly suggest upgrading to R16B02 if you can. R14 is pretty ancient
> and a *lot* of bug fixes have appeared in erts + OTP since then.
>
>
ok good advice, we'll do that

> > When I look at the Process Information it seems there's a small number
> with ALOT of messages queued, and the rest are an order of magnitude lower:
> >
>
> That's not unusual.
>
> > when I view the second process (first one crashes erlang on me), I see a
> large number of sender_death events (not sure if these are common or highly
> unusual ?)
> >
> > {'$gen_cast',{gm,{sender_death,<2710.20649.64>}}}
> >
>
> Interesting - will take a look at that. If you could provide logs for the
> participating nodes during this whole time period, that would help a lot.
>
> > mixed in with other more regular events:
> >
>
> Actually, sender_death messages are not "irregular" as such. They're just
> notifying the GM group members that another member (on another node) has
> died. This is quite normal with mirrored queues, when nodes get partitioned
> or stopped due to cluster recovery modes.
>
> Cheers,
> Tim
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131017/b44ef053/attachment.htm>