[rabbitmq-discuss] Someone else with a nodedown error
Tim Watson
tim at rabbitmq.com
Thu May 16 21:54:13 BST 2013
Hi Eric,
> Thanks for the insight Tim. I ended up killing the the process before receiving your email so I was unable to take advantage of your pro tips. We subsequently upgraded our cluster to 3.1.0 on Erlang R15B03, and have run into a slightly different problem. This appears to be unrelated, but I did use your tips to gather as much information as possible.
>
> Scenario:
> A producer is sending messages to rabbit faster than they can be routed so the connection goes under flow control. The flow control lasts long enough so that our load balancer severs the connection. Before we understood that the LB was our problem we would just retry socket write up to 10 times before throwing an exception. In the end, our cluster now has 0 connections yet it still has nearly 4000 channels (screen grab attached). Also, now I am unable to stop_app on any node in the cluster without it hanging for, at this point, over an hour.
That looks and sounds like a seriously unhappy Rabbit - I'm very concerned, especially about those open channels, but it's quite late, so I'll need to get into this properly in the morning. For the time being though, it looks a lot like something has gone horribly wrong with your Erlang installation. What happens if you start up Erlang by itself, using `erl -sname test` - do you still see all those screwy warnings? It seems as though the emulator's code server is unable to read from the file system - this could be due to a change in permissions somewhere on the host's file system, *or* it could be that you're running the `erl -sname debug -remsh rabbit at rabbit-box` command from a user account with insufficient permissions.
Can you run `lsof` and let us know what the hosts use of file descriptors looks like please. A Rabbit should never have open channels that aren't tied to a connection. It looks to me like your rabbit is stuck either starting up or shutting down - that's why the application controller isn't responding to calls (or is timing out). In fact, general stats about the state of your OS would be really useful - file descriptor usage, current disk io, cpu and memory use would all help. If you've got historic data about these too, especially around the time of your traffic spike, all the better.
It would be *very* helpful if you'd be able to grant us ssh access to this machine - we might be able to sign some legal stuff if that helps. Without access to the machine, the diagnostic process is going to involve a fair bit of to-ing and fro-ing. We could make use of IM/IRC if that helps too. Drop me a note and let me know what's likely to work best for you.
To clarify
> Questions:
> 1. Should rabbit maintain channels without active connections? If not, has anyone run into this before?
No, never. I've not seen this one before, but I'll do some archeology in our bug tracker and the mailing list to see if there's anything that might help.
> 2. Is there any way to force kill these channels?
We might be able to find a way to do that, but TBH the state your rabbit is in, I think it'll need to be killed. BUT PLEASE don't do that. It's really important that we understand how your system got into this state, because it represents a really nasty bug that we should fix asap, and having a running Rabbit in this state represents our best chance at understanding what's going on..
> 3. Is it a good practice for consumers to connect to rabbit via a load balancer, or should that be for producers only? Are there any recommended timeout settings?
You should be able to use a LB for either kind of client, the choice of timeout should be application and infrastructure dependent - i.e., whatever makes sense for your use case - and doing so should *not* get you into this state, ever. We *will* find out what's gone on here and fix it, asap. The *only* caveat here is that, if something catastrophic has gone wrong with our host machine (e.g., the file system has bombed out, or something like that) then it might be that no application would stand any chance of surviving intact, but we need to understand what's gone on here either way.
Cheers,
Tim
On 16 May 2013, at 20:09, Eric Berg wrote:
>
> Here are the results of some commands on the shell to the broker:
>
>
> bash-4.1$ erl -sname debug -remsh rabbit at rabbit-box
> {error_logger,{{2013,5,16},{14,44,36}},std_error,"File operation error: eacces. Target: .. Function: read_file_info. Process: code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error: eacces. Target: ./beam_lib.beam. Function: get_file. Process: code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error: eacces. Target: ./ram_file.beam. Function: get_file. Process: code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error: eacces. Target: ./standard_error.beam. Function: get_file. Process: code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error: eacces. Target: ./supervisor_bridge.beam. Function: get_file. Process: code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error: eacces. Target: ./user_sup.beam. Function: get_file. Process: code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error: eacces. Target: ./user_drv.beam. Function: get_file. Process: code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error: eacces. Target: ./group.beam. Function: get_file. Process: code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error: eacces. Target: ./edlin.beam. Function: get_file. Process: code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error: eacces. Target: ./io_lib.beam. Function: get_file. Process: code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error: eacces. Target: ./proplists.beam. Function: get_file. Process: code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error: eacces. Target: ./io_lib_format.beam. Function: get_file. Process: code_server."}
> Erlang R15B03 (erts-5.9.3.1) [source] [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false]
>
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error: eacces. Target: ./net.beam. Function: get_file. Process: code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error: eacces. Target: ./inet_gethost_native.beam. Function: get_file. Process: code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error: eacces. Target: ./kernel_config.beam. Function: get_file. Process: code_server."}
>
> =ERROR REPORT==== 16-May-2013::10:44:36 ===
> File operation error: eacces. Target: .. Function: read_file_info. Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::10:44:37 ===
> File operation error: eacces. Target: ./beam_lib.beam. Function: get_file. Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::10:44:37 ===
> File operation error: eacces. Target: ./ram_file.beam. Function: get_file. Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::10:44:37 ===
> File operation error: eacces. Target: ./standard_error.beam. Function: get_file. Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::10:44:37 ===
> File operation error: eacces. Target: ./supervisor_bridge.beam. Function: get_file. Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::10:44:37 ===
> File operation error: eacces. Target: ./user_sup.beam. Function: get_file. Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::10:44:37 ===
> File operation error: eacces. Target: ./user_drv.beam. Function: get_file. Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::10:44:37 ===
> File operation error: eacces. Target: ./group.beam. Function: get_file. Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::10:44:37 ===
> File operation error: eacces. Target: ./edlin.beam. Function: get_file. Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::10:44:37 ===
> {lost_messages,6}
>
> =ERROR REPORT==== 16-May-2013::14:44:37 ===
> File operation error: eacces. Target: ./error_logger_tty_h.beam. Function: get_file. Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::14:44:37 ===
> File operation error: eacces. Target: ./calendar.beam. Function: get_file. Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::14:44:37 ===
> File operation error: eacces. Target: ./io_lib_pretty.beam. Function: get_file. Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::14:44:37 ===
> File operation error: eacces. Target: ./io.beam. Function: get_file. Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::14:44:37 ===
> File operation error: eacces. Target: ./erl_scan.beam. Function: get_file. Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::14:44:37 ===
> File operation error: eacces. Target: ./c.beam. Function: get_file. Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::14:44:37 ===
> File operation error: eacces. Target: ./erl_eval.beam. Function: get_file. Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::14:44:37 ===
> File operation error: eacces. Target: ./orddict.beam. Function: get_file. Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::14:44:37 ===
> File operation error: eacces. Target: ./file_io_server.beam. Function: get_file. Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::14:44:37 ===
> File operation error: eacces. Target: ./erl_posix_msg.beam. Function: get_file. Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::14:44:37 ===
> file:path_eval([".","/var/lib/rabbitmq"],".erlang"): permission denied
>
> =ERROR REPORT==== 16-May-2013::14:44:37 ===
> File operation error: eacces. Target: ./dist_util.beam. Function: get_file. Process: code_server.
> Eshell V5.9.3.1 (abort with ^G)
>
>
>
>
> -------------------------------------------------
>
> (rabbit at rabbit-box)1> whereis(rabbit).
> <0.251.0>
>
>
> -------------------------------------------------
>
> (rabbit at rabbit-box)2> application:loaded_applications().
> [{public_key,"Public key infrastructure","0.17"},
> {rabbitmq_management_agent,"RabbitMQ Management Agent",
> "3.1.0"},
> {os_mon,"CPO CXC 138 46","2.2.10"},
> {amqp_client,"RabbitMQ AMQP Client","3.1.0"},
> {mnesia,"MNESIA CXC 138 12","4.7.1"},
> {rabbitmq_shovel,"Data Shovel for RabbitMQ","3.1.0"},
> {inets,"INETS CXC 138 49","5.9.2"},
> {rabbitmq_federation,"RabbitMQ Federation","3.1.0"},
> {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.1.0"},
> {rabbitmq_federation_management,"RabbitMQ Federation Management",
> "3.1.0"},
> {rabbitmq_management_visualiser,"RabbitMQ Visualiser",
> "3.1.0"},
> {kernel,"ERTS CXC 138 10","2.15.3"},
> {crypto,"CRYPTO version 2","2.2"},
> {ssl,"Erlang/OTP SSL application","5.1.2"},
> {sasl,"SASL CXC 138 11","2.2.1"},
> {rabbitmq_management,"RabbitMQ Management Console","3.1.0"},
> {mochiweb,"MochiMedia Web Server",
> "2.3.1-rmq3.1.0-gitd541e9a"},
> {xmerl,"XML parser","1.3.2"},
> {rabbit,"RabbitMQ","3.1.0"},
> {stdlib,"ERTS CXC 138 10","1.18.3"},
> {webmachine,"webmachine","1.9.1-rmq3.1.0-git52e62bc"}]
>
>
> ------------------------------------------------------------------------------------
>
>
> (rabbit at rabbit-box)3> application:which_applications().
> ** exception exit: {timeout,{gen_server,call,
> [application_controller,which_applications]}}
> in function gen_server:call/2 (gen_server.erl, line 180)
>
>
> -----------------------------------------------------------------------------------
>
> (rabbit at rabbit-box)4> rabbit:stop().
>
>
> -- This hangs indefinitely
>
>
> ------------------------------------------------------------------------------------
>
> Perms on /var/lib/rabbitmq
> drwxr-xr-x 3 rabbitmq rabbitmq 4096 May 1 07:15 rabbitmq
>
>
>
>
> Thanks Tim et al.
>
>
>
>
> On Thu, May 9, 2013 at 10:00 AM, Tim Watson <tim at rabbitmq.com> wrote:
> Hi,
>
> On 8 May 2013, at 18:22, Eric Berg wrote:
>
>> I have ready through many of these nodedown error emails and of course none of them seem to be exactly what I am experiencing.
>>
>> I have a 4 node cluster, and one of the nodes went offline according to the cluster. This box has the following in the sasl log:
>>
>> =SUPERVISOR REPORT==== 7-May-2013::14:37:22 ===
>> Supervisor: {<0.11197.1096>,
>> rabbit_channel_sup_sup}
>> Context: shutdown_error
>> Reason: noproc
>> Offender: [{pid,<0.11199.1096>},
>> {name,channel_sup},
>> {mfa,{rabbit_channel_sup,start_link,[]}},
>> {restart_type,temporary},
>> {shutdown,infinity},
>> {child_type,supervisor}]
>>
>
> This simply indicates that and error occurred whilst a supervised process was shutting down. It's not indicative of the whole node going down - Erlang allows processes to crash and be restarted whilst the system is running.
>
>> Yet in the regular rabbit log i can see that it was still accepting connections up until 2:22AM the next day:
>>
>> (last log entry)
>> =INFO REPORT==== 8-May-2013::02:22:26 ===
>> closing AMQP connection <0.18267.1145> (IPADDRESS:PORT -> IPADDRESS:PORT)
>>
>
> So clearly that node didn't actually go offline. The 'nodedown' message in the other clustered broker's logs does not necessarily mean that the node in question crashed; This could, for example, be indicative of a net-split or other connectivity failure.
>
>> Running rabbitmqctl status returns:
>>
>> [root at rabbit-box rabbitmq]# rabbitmqctl status
>> Status of node 'rabbit at rabbit-box' ...
>> Error: unable to connect to node 'rabbit at rabbit-box': nodedown
>>
>> DIAGNOSTICS
>> ===========
>>
>> nodes in question: ['rabbit at rabbit-box']
>>
>> hosts, their running nodes and ports:
>> - rabbit-box: [{rabbit,13957},{rabbitmqctl2301,16508}]
>>
>> current node details:
>> - node name: 'rabbitmqctl2301 at rabbit-box'
>> - home dir: /var/lib/rabbitmq
>> - cookie hash: qQwyFW90ZNbbrFvX1AtrxQ==
>
> Have you tried running this using `sudo' instead of as root? Is the rabbitmq user's account and home folder in a consistent state? The security cookie used for inter-node communications, which includes communication between the temporary `rabbitmqctl' node and the broker, has to be the same for all the peers.
>
>> A couple of notes:
>> - Looking for a process run by rabbit show that it appears to still be running
>
> Yes - as I said, there's no indication that this node actually died from what you've said. However `rabbitmqctl` should be able to connect to rabbit at rabbit-box at the very least.
>
>> - Erlang cookie is the same on all nodes of the cluster, the cookie hash is the same as well
>
> If it's not the cookies then....
>
>> - A traffic spike occurred right around the time of the last entry in the rabbit log
>
> It sounds like this could be a potential culprit. Can you provide any more information about what happened? It could be that whilst the network was saturated, the node in question got disconnected from the other nodes in the cluster because it exceeded the "net tick time" and subsequently things have started to go wrong. That shouldn't happen, viz the node should be able to re-establish connectivity, but it's possible that something's gone wrong here.
>
> What that doesn't explain is why you can't connect from rabbitmqctl. If you `su rabbitmq', can you then run `erl -sname debug -remsh rabbit at rabbit-box' to establish a shell into the running broker? If that does work, then you can stop the rabbit application and then the node, as follows:
>
> > rabbit:stop().
> ok
> > init:stop().
>
> But before you do, it might be worth evaluating a couple of other things that might help us identify what's going on:
>
> (rabbit at iske)1> whereis(rabbit).
> <0.152.0>
> (rabbit at iske)2> application:loaded_applications().
> [{os_mon,"CPO CXC 138 46","2.2.9"},
> {rabbitmq_management_agent,"RabbitMQ Management Agent",
> "0.0.0"},
> {amqp_client,"RabbitMQ AMQP Client","0.0.0"},
> etc ...
> ]
> (rabbit at iske)3> application:which_applications().
> [{rabbitmq_shovel_management,"Shovel Status","0.0.0"},
> etc ...
> ]
>
> If during any of these you get stuck, CTRL-C (and press the key for 'abort') should get you back out again without incident.
>
>
>> - I can find no other errors in any logs that relate to rabbit or erlang
>> - Up until this point the cluster has been running fine for over 40 days.
>> - telnet IP_ADDRESS 5672 times out
>
> So the broker is no longer accepting new AMQP connections then. Something's clearly quite wrong with this node.
>
>> - I have not restarted the box, erlang node, or entire rabbitmq-server
>>
>> Is there anywhere else I can go looking for errors? I am about to start killing processs, but Im not sure that will solve anything.
>>
>
> Did you do that in the end? If not, I would really like to get to the bottom of what's wrong with this node. I don't suppose it would be possible for you to give us access to this machine would it? If necessary, we may be able to get some kind of confidentiality agreement signed if that'd help.
>
> Cheers,
>
> Tim Watson
> Staff Engineer
> RabbitMQ
>
>
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>
> <rabbit-manychannels.png>_______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130516/1fdcc244/attachment.htm>
More information about the rabbitmq-discuss
mailing list