[rabbitmq-discuss] Someone else with a nodedown error

Eric Berg eric.berg at pardot.com
Fri May 17 13:45:57 BST 2013


Thanks for your response Tim. If you would like SSH access to these boxes
let me know, we can work something out privately. Thanks!

Update from yesterday:
It looks like 2 of the 4 nodes in our cluster have finally shut down, all
channels are now gone. Another node in the cluster hangs on
> sudo rabbitmqctl status

and the final node in the cluster appears to be running just fine. It
however sees the unresponsive node in the cluster status as a running node,
as does the web UI.


*When you upgraded your cluster, what RabbitMQ version did you upgrade from
and to, and did you upgrade Erlang as well and if so, which versions were
involved?
*
- we upgraded from 3.0.4 to 3.1.0, we did not upgrade Erlang it was/is at
version R15B03. We did however install it via RPM with the --nodeps flag
because it did not detect the Erlang dependency correctly. We had
previously installed Erlang:

esl-erlang.x86_64    R15B03-2           @erlang-solutions

*
What happens if you start up Erlang by itself, using `erl -sname test` - do
you still see all those screwy warnings? *
All 4 of the nodes can run this without issue as my user, when I sudo su to
rabbitmq user I get errors on 2 of the 4 nodes as such:

{error_logger,{{2013,5,17},{8,32,39}},std_error,"File operation error:
eacces. Target: .. Function: read_file_info. Process: code_server."}
{error_logger,{{2013,5,17},{8,32,39}},std_error,"File operation error:
eacces. Target: ./beam_lib.beam. Function: get_file. Process: code_server."}
{error_logger,{{2013,5,17},{8,32,39}},std_error,"File operation error:
eacces. Target: ./ram_file.beam. Function: get_file. Process: code_server."}
{error_logger,{{2013,5,17},{8,32,39}},std_error,"File operation error:
eacces. Target: ./standard_error.beam. Function: get_file. Process:
code_server."}
{error_logger,{{2013,5,17},{8,32,39}},std_error,"File operation error:
eacces. Target: ./supervisor_bridge.beam. Function: get_file. Process:
code_server."}
{error_logger,{{2013,5,17},{8,32,39}},std_error,"File operation error:
eacces. Target: ./user_sup.beam. Function: get_file. Process: code_server."}
{error_logger,{{2013,5,17},{8,32,39}},std_error,"File operation error:
eacces. Target: ./user_drv.beam. Function: get_file. Process: code_server."}
{error_logger,{{2013,5,17},{8,32,39}},std_error,"File operation error:
eacces. Target: ./group.beam. Function: get_file. Process: code_server."}
{error_logger,{{2013,5,17},{8,32,39}},std_error,"File operation error:
eacces. Target: ./edlin.beam. Function: get_file. Process: code_server."}
{error_logger,{{2013,5,17},{8,32,39}},std_error,"File operation error:
eacces. Target: ./io_lib.beam. Function: get_file. Process: code_server."}
{error_logger,{{2013,5,17},{8,32,39}},std_error,"File operation error:
eacces. Target: ./proplists.beam. Function: get_file. Process:
code_server."}
{error_logger,{{2013,5,17},{8,32,39}},std_error,"File operation error:
eacces. Target: ./io_lib_format.beam. Function: get_file. Process:
code_server."}
Erlang R15B03 (erts-5.9.3.1) [source] [64-bit] [smp:16:16]
[async-threads:0] [hipe] [kernel-poll:false]

{error_logger,{{2013,5,17},{8,32,39}},std_error,"File operation error:
eacces. Target: ./shell.beam. Function: get_file. Process: code_server."}
{error_logger,{{2013,5,17},{8,32,39}},std_error,"File operation error:
eacces. Target: ./user_default.beam. Function: get_file. Process:
code_server."}
{error_logger,{{2013,5,17},{8,32,39}},std_error,"File operation error:
eacces. Target: ./kernel_config.beam. Function: get_file. Process:
code_server."}

=ERROR REPORT==== 17-May-2013::04:32:39 ===
File operation error: eacces. Target: .. Function: read_file_info. Process:
code_server.


...............







On Fri, May 17, 2013 at 4:34 AM, Tim Watson <tim at rabbitmq.com> wrote:

> Some more questions as well...
>
> When you upgraded your cluster, what RabbitMQ version did you upgrade from
> and to, and did you upgrade Erlang as well and if so, which versions were
> involved?
>
> Cheers,
> Tim
>
> On 16 May 2013, at 21:54, Tim Watson wrote:
>
> Hi Eric,
>
> Thanks for the insight Tim. I ended up killing the the process before
> receiving your email so I was unable to take advantage of your pro tips. We
> subsequently upgraded our cluster to 3.1.0 on Erlang R15B03, and have run
> into a slightly different problem. This appears to be unrelated, but I did
> use your tips to gather as much information as possible.
>
> Scenario:
> A producer is sending messages to rabbit faster than they can be routed so
> the connection goes under flow control. The flow control lasts long enough
> so that our load balancer severs the connection. Before we understood that
> the LB was our problem we would just retry socket write up to 10 times
> before throwing an exception. In the end, our cluster now has 0 connections
> yet it still has nearly 4000 channels (screen grab attached). Also, now I
> am unable to stop_app on any node in the cluster without it hanging for, at
> this point, over an hour.
>
>
> That looks and sounds like a seriously unhappy Rabbit - I'm very
> concerned, especially about those open channels, but it's quite late, so
> I'll need to get into this properly in the morning. For the time being
> though, it looks a lot like something has gone horribly wrong with your
> Erlang installation. What happens if you start up Erlang by itself, using
> `erl -sname test` - do you still see all those screwy warnings? It seems as
> though the emulator's code server is unable to read from the file system -
> this could be due to a change in permissions somewhere on the host's file
> system, *or* it could be that you're running the `erl -sname debug -remsh
> rabbit at rabbit-box` command from a user account with insufficient
> permissions.
>
> Can you run `lsof` and let us know what the hosts use of file descriptors
> looks like please. A Rabbit should never have open channels that aren't
> tied to a connection. It looks to me like your rabbit is stuck either
> starting up or shutting down - that's why the application controller isn't
> responding to calls (or is timing out). In fact, general stats about the
> state of your OS would be really useful - file descriptor usage, current
> disk io, cpu and memory use would all help. If you've got historic data
> about these too, especially around the time of your traffic spike, all the
> better.
>
> It would be *very* helpful if you'd be able to grant us ssh access to this
> machine - we might be able to sign some legal stuff if that helps. Without
> access to the machine, the diagnostic process is going to involve a fair
> bit of to-ing and fro-ing. We could make use of IM/IRC if that helps too.
> Drop me a note and let me know what's likely to work best for you.
>
> To clarify
>
> Questions:
> 1. Should rabbit maintain channels without active connections? If not, has
> anyone run into this before?
>
>
> No, never. I've not seen this one before, but I'll do some archeology in
> our bug tracker and the mailing list to see if there's anything that might
> help.
>
> 2. Is there any way to force kill these channels?
>
>
> We might be able to find a way to do that, but TBH the state your rabbit
> is in, I think it'll need to be killed. BUT PLEASE don't do that. It's
> really important that we understand how your system got into this state,
> because it represents a really nasty bug that we should fix asap, and
> having a running Rabbit in this state represents our best chance at
> understanding what's going on..
>
> 3. Is it a good practice for consumers to connect to rabbit via a load
> balancer, or should that be for producers only? Are there any recommended
> timeout settings?
>
>
> You should be able to use a LB for either kind of client, the choice of
> timeout should be application and infrastructure dependent - i.e., whatever
> makes sense for your use case - and doing so should *not* get you into this
> state, ever. We *will* find out what's gone on here and fix it, asap. The
> *only* caveat here is that, if something catastrophic has gone wrong with
> our host machine (e.g., the file system has bombed out, or something like
> that) then it might be that no application would stand any chance of
> surviving intact, but we need to understand what's gone on here either way.
>
> Cheers,
> Tim
>
> On 16 May 2013, at 20:09, Eric Berg wrote:
>
>
> Here are the results of some commands on the shell to the broker:
>
>
> bash-4.1$ erl -sname debug -remsh rabbit at rabbit-box
> {error_logger,{{2013,5,16},{14,44,36}},std_error,"File operation error:
> eacces. Target: .. Function: read_file_info. Process: code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
> eacces. Target: ./beam_lib.beam. Function: get_file. Process: code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
> eacces. Target: ./ram_file.beam. Function: get_file. Process: code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
> eacces. Target: ./standard_error.beam. Function: get_file. Process:
> code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
> eacces. Target: ./supervisor_bridge.beam. Function: get_file. Process:
> code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
> eacces. Target: ./user_sup.beam. Function: get_file. Process: code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
> eacces. Target: ./user_drv.beam. Function: get_file. Process: code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
> eacces. Target: ./group.beam. Function: get_file. Process: code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
> eacces. Target: ./edlin.beam. Function: get_file. Process: code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
> eacces. Target: ./io_lib.beam. Function: get_file. Process: code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
> eacces. Target: ./proplists.beam. Function: get_file. Process:
> code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
> eacces. Target: ./io_lib_format.beam. Function: get_file. Process:
> code_server."}
> Erlang R15B03 (erts-5.9.3.1) [source] [64-bit] [smp:16:16]
> [async-threads:0] [hipe] [kernel-poll:false]
>
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
> eacces. Target: ./net.beam. Function: get_file. Process: code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
> eacces. Target: ./inet_gethost_native.beam. Function: get_file. Process:
> code_server."}
> {error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
> eacces. Target: ./kernel_config.beam. Function: get_file. Process:
> code_server."}
>
> =ERROR REPORT==== 16-May-2013::10:44:36 ===
> File operation error: eacces. Target: .. Function: read_file_info.
> Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::10:44:37 ===
> File operation error: eacces. Target: ./beam_lib.beam. Function: get_file.
> Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::10:44:37 ===
> File operation error: eacces. Target: ./ram_file.beam. Function: get_file.
> Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::10:44:37 ===
> File operation error: eacces. Target: ./standard_error.beam. Function:
> get_file. Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::10:44:37 ===
> File operation error: eacces. Target: ./supervisor_bridge.beam. Function:
> get_file. Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::10:44:37 ===
> File operation error: eacces. Target: ./user_sup.beam. Function: get_file.
> Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::10:44:37 ===
> File operation error: eacces. Target: ./user_drv.beam. Function: get_file.
> Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::10:44:37 ===
> File operation error: eacces. Target: ./group.beam. Function: get_file.
> Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::10:44:37 ===
> File operation error: eacces. Target: ./edlin.beam. Function: get_file.
> Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::10:44:37 ===
> {lost_messages,6}
>
> =ERROR REPORT==== 16-May-2013::14:44:37 ===
> File operation error: eacces. Target: ./error_logger_tty_h.beam. Function:
> get_file. Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::14:44:37 ===
> File operation error: eacces. Target: ./calendar.beam. Function: get_file.
> Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::14:44:37 ===
> File operation error: eacces. Target: ./io_lib_pretty.beam. Function:
> get_file. Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::14:44:37 ===
> File operation error: eacces. Target: ./io.beam. Function: get_file.
> Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::14:44:37 ===
> File operation error: eacces. Target: ./erl_scan.beam. Function: get_file.
> Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::14:44:37 ===
> File operation error: eacces. Target: ./c.beam. Function: get_file.
> Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::14:44:37 ===
> File operation error: eacces. Target: ./erl_eval.beam. Function: get_file.
> Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::14:44:37 ===
> File operation error: eacces. Target: ./orddict.beam. Function: get_file.
> Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::14:44:37 ===
> File operation error: eacces. Target: ./file_io_server.beam. Function:
> get_file. Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::14:44:37 ===
> File operation error: eacces. Target: ./erl_posix_msg.beam. Function:
> get_file. Process: code_server.
>
> =ERROR REPORT==== 16-May-2013::14:44:37 ===
> file:path_eval([".","/var/lib/rabbitmq"],".erlang"): permission denied
>
> =ERROR REPORT==== 16-May-2013::14:44:37 ===
> File operation error: eacces. Target: ./dist_util.beam. Function:
> get_file. Process: code_server.
> Eshell V5.9.3.1  (abort with ^G)
>
>
>
>
> -------------------------------------------------
>
> (rabbit at rabbit-box)1> whereis(rabbit).
> <0.251.0>
>
>
> -------------------------------------------------
>
> (rabbit at rabbit-box)2> application:loaded_applications().
> [{public_key,"Public key infrastructure","0.17"},
>  {rabbitmq_management_agent,"RabbitMQ Management Agent",
>                             "3.1.0"},
>  {os_mon,"CPO  CXC 138 46","2.2.10"},
>  {amqp_client,"RabbitMQ AMQP Client","3.1.0"},
>  {mnesia,"MNESIA  CXC 138 12","4.7.1"},
>  {rabbitmq_shovel,"Data Shovel for RabbitMQ","3.1.0"},
>  {inets,"INETS  CXC 138 49","5.9.2"},
>  {rabbitmq_federation,"RabbitMQ Federation","3.1.0"},
>  {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.1.0"},
>  {rabbitmq_federation_management,"RabbitMQ Federation Management",
>                                  "3.1.0"},
>  {rabbitmq_management_visualiser,"RabbitMQ Visualiser",
>                                  "3.1.0"},
>  {kernel,"ERTS  CXC 138 10","2.15.3"},
>  {crypto,"CRYPTO version 2","2.2"},
>  {ssl,"Erlang/OTP SSL application","5.1.2"},
>  {sasl,"SASL  CXC 138 11","2.2.1"},
>  {rabbitmq_management,"RabbitMQ Management Console","3.1.0"},
>  {mochiweb,"MochiMedia Web Server",
>            "2.3.1-rmq3.1.0-gitd541e9a"},
>  {xmerl,"XML parser","1.3.2"},
>  {rabbit,"RabbitMQ","3.1.0"},
>  {stdlib,"ERTS  CXC 138 10","1.18.3"},
>  {webmachine,"webmachine","1.9.1-rmq3.1.0-git52e62bc"}]
>
>
>
> ------------------------------------------------------------------------------------
>
>
> (rabbit at rabbit-box)3> application:which_applications().
> ** exception exit: {timeout,{gen_server,call,
>
> [application_controller,which_applications]}}
>      in function  gen_server:call/2 (gen_server.erl, line 180)
>
>
>
> -----------------------------------------------------------------------------------
>
> (rabbit at rabbit-box)4> rabbit:stop().
>
>
> -- This hangs indefinitely
>
>
>
> ------------------------------------------------------------------------------------
>
> Perms on /var/lib/rabbitmq
> drwxr-xr-x  3 rabbitmq rabbitmq 4096 May  1 07:15 rabbitmq
>
>
>
>
> Thanks Tim et al.
>
>
>
>
> On Thu, May 9, 2013 at 10:00 AM, Tim Watson <tim at rabbitmq.com> wrote:
>
>> Hi,
>>
>> On 8 May 2013, at 18:22, Eric Berg wrote:
>>
>> I have ready through many of these nodedown error emails and of course
>> none of them seem to be exactly what I am experiencing.
>>
>> I have a 4 node cluster, and one of the nodes went offline according to
>> the cluster. This box has the following in the sasl log:
>>
>> =SUPERVISOR REPORT==== 7-May-2013::14:37:22 ===
>>      Supervisor: {<0.11197.1096>,
>>                                            rabbit_channel_sup_sup}
>>      Context:    shutdown_error
>>      Reason:     noproc
>>      Offender:   [{pid,<0.11199.1096>},
>>                   {name,channel_sup},
>>                   {mfa,{rabbit_channel_sup,start_link,[]}},
>>                   {restart_type,temporary},
>>                   {shutdown,infinity},
>>                   {child_type,supervisor}]
>>
>>
>> This simply indicates that and error occurred whilst a supervised process
>> was shutting down. It's not indicative of the whole node going down -
>> Erlang allows processes to crash and be restarted whilst the system is
>> running.
>>
>> *Yet in the regular rabbit log i can see that it was still accepting
>> connections up until 2:22AM the next day:*
>>
>> (last log entry)
>> =INFO REPORT==== 8-May-2013::02:22:26 ===
>> closing AMQP connection <0.18267.1145> (IPADDRESS:PORT -> IPADDRESS:PORT)
>>
>>
>> So clearly that node didn't actually go offline. The 'nodedown' message
>> in the other clustered broker's logs does not necessarily mean that the
>> node in question crashed; This could, for example, be indicative of a
>> net-split or other connectivity failure.
>>
>> *Running rabbitmqctl status returns:*
>>
>> [root at rabbit-box rabbitmq]# rabbitmqctl status
>> Status of node 'rabbit at rabbit-box' ...
>> Error: unable to connect to node 'rabbit at rabbit-box': nodedown
>>
>> DIAGNOSTICS
>> ===========
>>
>> nodes in question: ['rabbit at rabbit-box']
>>
>> hosts, their running nodes and ports:
>> - rabbit-box: [{rabbit,13957},{rabbitmqctl2301,16508}]
>>
>> current node details:
>> - node name: 'rabbitmqctl2301 at rabbit-box'
>> - home dir: /var/lib/rabbitmq
>> - cookie hash: qQwyFW90ZNbbrFvX1AtrxQ==
>>
>>
>> Have you tried running this using `sudo' instead of as root? Is the
>> rabbitmq user's account and home folder in a consistent state? The security
>> cookie used for inter-node communications, which includes communication
>> between the temporary `rabbitmqctl' node and the broker, has to be the same
>> for all the peers.
>>
>> A couple of notes:
>> - Looking for a process run by rabbit show that it appears to still be
>> running
>>
>>
>> Yes - as I said, there's no indication that this node actually died from
>> what you've said. However `rabbitmqctl` should be able to connect to
>> rabbit at rabbit-box at the very least.
>>
>> - Erlang cookie is the same on all nodes of the cluster, the cookie hash
>> is the same as well
>>
>>
>> If it's not the cookies then....
>>
>> - A traffic spike occurred right around the time of the last entry in the
>> rabbit log
>>
>>
>> It sounds like this could be a potential culprit. Can you provide any
>> more information about what happened? It could be that whilst the network
>> was saturated, the node in question got disconnected from the other nodes
>> in the cluster because it exceeded the "net tick time" and subsequently
>> things have started to go wrong. That shouldn't happen, viz the node should
>> be able to re-establish connectivity, but it's possible that something's
>> gone wrong here.
>>
>> What that doesn't explain is why you can't connect from rabbitmqctl. If
>> you `su rabbitmq', can you then run `erl -sname debug -remsh
>> rabbit at rabbit-box' to establish a shell into the running broker? If that
>> does work, then you can stop the rabbit application and then the node, as
>> follows:
>>
>> > rabbit:stop().
>> ok
>> > init:stop().
>>
>> But before you do, it might be worth evaluating a couple of other things
>> that might help us identify what's going on:
>>
>> (rabbit at iske)1> whereis(rabbit).
>> <0.152.0>
>> (rabbit at iske)2> application:loaded_applications().
>> [{os_mon,"CPO  CXC 138 46","2.2.9"},
>>  {rabbitmq_management_agent,"RabbitMQ Management Agent",
>>                             "0.0.0"},
>>  {amqp_client,"RabbitMQ AMQP Client","0.0.0"},
>>  etc ...
>>  ]
>> (rabbit at iske)3> application:which_applications().
>> [{rabbitmq_shovel_management,"Shovel Status","0.0.0"},
>>  etc ...
>> ]
>>
>> If during any of these you get stuck, CTRL-C (and press the key for
>> 'abort') should get you back out again without incident.
>>
>>
>> - I can find no other errors in any logs that relate to rabbit or erlang
>> - Up until this point the cluster has been running fine for over 40 days.
>>
>> - telnet IP_ADDRESS 5672 times out
>>
>>
>> So the broker is no longer accepting new AMQP connections then.
>> Something's clearly quite wrong with this node.
>>
>> - I have not restarted the box, erlang node, or entire rabbitmq-server
>>
>> Is there anywhere else I can go looking for errors? I am about to start
>> killing processs, but Im not sure that will solve anything.
>>
>>
>> Did you do that in the end? If not, I would really like to get to the
>> bottom of what's wrong with this node. I don't suppose it would be possible
>> for you to give us access to this machine would it? If necessary, we may be
>> able to get some kind of confidentiality agreement signed if that'd help.
>>
>> Cheers,
>>
>> Tim Watson
>> Staff Engineer
>> RabbitMQ
>>
>>
>>
>>
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> rabbitmq-discuss at lists.rabbitmq.com
>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>
>>
> <rabbit-manychannels.png>_______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130517/4d3e8a0a/attachment.htm>


More information about the rabbitmq-discuss mailing list