[rabbitmq-discuss] Crash with RabbitMQ 3.1.5

Wed Oct 16 16:34:09 BST 2013

Quick update on the queue count: 56

On 17 October 2013 02:29, David Harrison <dave.l.harrison at gmail.com> wrote:

> On 17 October 2013 01:29, Tim Watson <tim at rabbitmq.com> wrote:
>
>> Hello David!
>>
>>
> Hey Tim, thanks for replying so quickly!
>
>
>>  On 16 Oct 2013, at 15:14, David Harrison wrote:
>> > Hoping someone can help me out.  We recently experienced 2 crashes with
>> our RabbitMQ cluster.  After the first crash, we moved the Mnesia
>> directories elsewhere, and started RabbitMQ again.  This got us up and
>> running.  Second time it happened, we had the original nodes plus an
>> additional 5 nodes we had added to the cluster that we were planning to
>> leave in place while shutting the old nodes down.
>> >
>>
>> What version of rabbit are you running, and how was it installed?
>>
>
> 3.1.5, running on Ubuntu Precise, installed via deb package.
>
>
>>
>> > During the crash symptoms were as follows:
>> >
>> > - Escalating (and sudden) CPU utilisation on some (but not all) nodes
>>
>> We've fixed at least one bug with that symptom in recent releases.
>>
>
> I think 3.1.5 is the latest stable ??
>
>
>>
>> > - Increasing time to publish on queues (and specifically on a test
>> queue we have setup that exists only to test publishing and consuming from
>> the cluster hosts)
>>
>> Are there multiple publishers on the same connection/channel when this
>> happens? It wouldn't be unusual, if the server was struggling, to see flow
>> control kick in and affect publishers in this fashion.
>>
>
> Yes in some cases there would be, for our test queue there wouldn't be --
> we saw up to 10s on the test queue though
>
>
>>
>> > - Running `rabbitmqctl cluster status` gets increasingly slow (some
>> nodes eventually taking up to 10m to return with the response data - some
>> were fast and took 5s)
>>
>> Wow, 10m is amazingly slow. Can you provide log files for this period of
>> activity and problems?
>>
>
> I'll take a look, we saw a few "too many processes" messages,
>
> "Generic server net_kernel terminating" followed by :
>
> ** Reason for termination ==
> ** {system_limit,[{erlang,spawn_opt,
> [inet_tcp_dist,do_setup,
> [<0.19.0>,'rabbit at b02.internal',normal,
> 'rabbit at b00.internal',longnames,7000],
> [link,{priority,max}]]},
> {net_kernel,setup,4},
> {net_kernel,handle_call,3},
> {gen_server,handle_msg,5},
> {proc_lib,init_p_do_apply,3}]}
>
>
> =ERROR REPORT==== 15-Oct-2013::16:07:10 ===
> ** gen_event handler rabbit_error_logger crashed.
> ** Was installed in error_logger
> ** Last event was: {error,<0.8.0>,{emulator,"~s~n",["Too many processes\n"]}}
> ** When handler state == {resource,<<"/">>,exchange,<<"amq.rabbitmq.log">>}
> ** Reason == {aborted,
>                  {no_exists,
>                      [rabbit_topic_trie_edge,
>                       {trie_edge,
>                           {resource,<<"/">>,exchange,<<"amq.rabbitmq.log">>},
>                           root,"error"}]}}
>
>
> =ERROR REPORT==== 15-Oct-2013::16:07:10 ===
> Mnesia(nonode at nohost): ** ERROR ** mnesia_controller got unexpected info: {'EXIT',
> <0.97.0>,
> shutdown}
>
> =ERROR REPORT==== 15-Oct-2013::16:11:38 ===
> Mnesia('rabbit at b00.internal'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit at b01.internal'}
>
>
>
>
>> > - When trying to shut down a node, running `rabbitmqctl stop_app`
>> appears to block on epmd and doesn't return
>>
>> Again, we've fixed bugs in that area in recent releases.
>>
>> > --- When that doesn't return we eventually have to ctrl-c the command
>> > --- We have to issue a kill signal to rabbit to stop it
>> > --- Do the same to the epmd process
>>
>> Even if you have to `kill -9' a rabbit node, you shouldn't need to kill
>> epmd. In theory at least. If that was necessary to fix the "state of the
>> world", it would be indicative of a problem related to the erlang
>> distribution mechanism, but I very much doubt that's the case here.
>>
>> > Config / details as follows (we use mirrored queues -- 5 hosts, all
>> disc nodes, with a global policy that all queues are mirrored
>> "ha-mode:all"), running on Linux
>> >
>>
>> How many queues are we talking about here?
>>
>
> ~30
>
>
>>
>> > [
>> >         {rabbit, [
>> >                 {cluster_nodes, {['rabbit at b05.internal',
>> 'rabbit at b06.internal','rabbit at b07.internal','rabbit at b08.internal
>> ','rabbit at b09.internal'], disc}},
>> >                 {cluster_partition_handling, pause_minority}
>>
>> Are you sure that what you're seeing is not caused by a network
>> partition? If it were, any nodes in a minority island would "pause", which
>> would certainly lead to the kind of symptoms you've mentioned here, viz
>> rabbitmqctl calls not returning and so on.
>>
>
> There was definitely a network partition, but the whole cluster nose dived
> during the crash
>
>
>>
>> > The erl_crash.dump slogan error was : eheap_alloc: Cannot allocate
>> 2850821240 bytes of memory (of type "old_heap").
>> >
>>
>> That's a plain old OOM failure. Rabbit ought to start deliberately paging
>> messages to disk well before that happens, which might also explain a lot
>> of the slow/unresponsive-ness.
>>
>
> These hosts aren't running swap, we give them a fair bit of RAM (gave them
> even more now as part of a possible stop gap)
>
>
>>
>> > System version : Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2]
>> [rq:2] [async-threads:0] [kernel-poll:false]
>> >
>>
>> I'd strongly suggest upgrading to R16B02 if you can. R14 is pretty
>> ancient and a *lot* of bug fixes have appeared in erts + OTP since then.
>>
>>
> ok good advice, we'll do that
>
>
>>  > When I look at the Process Information it seems there's a small number
>> with ALOT of messages queued, and the rest are an order of magnitude lower:
>> >
>>
>> That's not unusual.
>>
>> > when I view the second process (first one crashes erlang on me), I see
>> a large number of sender_death events (not sure if these are common or
>> highly unusual ?)
>> >
>> > {'$gen_cast',{gm,{sender_death,<2710.20649.64>}}}
>> >
>>
>> Interesting - will take a look at that. If you could provide logs for the
>> participating nodes during this whole time period, that would help a lot.
>>
>> > mixed in with other more regular events:
>> >
>>
>> Actually, sender_death messages are not "irregular" as such. They're just
>> notifying the GM group members that another member (on another node) has
>> died. This is quite normal with mirrored queues, when nodes get partitioned
>> or stopped due to cluster recovery modes.
>>
>> Cheers,
>> Tim
>>
>>
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> rabbitmq-discuss at lists.rabbitmq.com
>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131017/56a3dea0/attachment.htm>