[rabbitmq-discuss] Someone else with a nodedown error

Eric Berg eric.berg at pardot.com
Thu May 16 20:09:58 BST 2013


Thanks for the insight Tim. I ended up killing the the process before
receiving your email so I was unable to take advantage of your pro tips. We
subsequently upgraded our cluster to 3.1.0 on Erlang R15B03, and have run
into a slightly different problem. This appears to be unrelated, but I did
use your tips to gather as much information as possible.

Scenario:
A producer is sending messages to rabbit faster than they can be routed so
the connection goes under flow control. The flow control lasts long enough
so that our load balancer severs the connection. Before we understood that
the LB was our problem we would just retry socket write up to 10 times
before throwing an exception. In the end, our cluster now has 0 connections
yet it still has nearly 4000 channels (screen grab attached). Also, now I
am unable to stop_app on any node in the cluster without it hanging for, at
this point, over an hour.

Questions:
1. Should rabbit maintain channels without active connections? If not, has
anyone run into this before?
2. Is there any way to force kill these channels?
3. Is it a good practice for consumers to connect to rabbit via a load
balancer, or should that be for producers only? Are there any recommended
timeout settings?


Here are the results of some commands on the shell to the broker:


bash-4.1$ erl -sname debug -remsh rabbit at rabbit-box
{error_logger,{{2013,5,16},{14,44,36}},std_error,"File operation error:
eacces. Target: .. Function: read_file_info. Process: code_server."}
{error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
eacces. Target: ./beam_lib.beam. Function: get_file. Process: code_server."}
{error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
eacces. Target: ./ram_file.beam. Function: get_file. Process: code_server."}
{error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
eacces. Target: ./standard_error.beam. Function: get_file. Process:
code_server."}
{error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
eacces. Target: ./supervisor_bridge.beam. Function: get_file. Process:
code_server."}
{error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
eacces. Target: ./user_sup.beam. Function: get_file. Process: code_server."}
{error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
eacces. Target: ./user_drv.beam. Function: get_file. Process: code_server."}
{error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
eacces. Target: ./group.beam. Function: get_file. Process: code_server."}
{error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
eacces. Target: ./edlin.beam. Function: get_file. Process: code_server."}
{error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
eacces. Target: ./io_lib.beam. Function: get_file. Process: code_server."}
{error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
eacces. Target: ./proplists.beam. Function: get_file. Process:
code_server."}
{error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
eacces. Target: ./io_lib_format.beam. Function: get_file. Process:
code_server."}
Erlang R15B03 (erts-5.9.3.1) [source] [64-bit] [smp:16:16]
[async-threads:0] [hipe] [kernel-poll:false]

{error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
eacces. Target: ./net.beam. Function: get_file. Process: code_server."}
{error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
eacces. Target: ./inet_gethost_native.beam. Function: get_file. Process:
code_server."}
{error_logger,{{2013,5,16},{14,44,37}},std_error,"File operation error:
eacces. Target: ./kernel_config.beam. Function: get_file. Process:
code_server."}

=ERROR REPORT==== 16-May-2013::10:44:36 ===
File operation error: eacces. Target: .. Function: read_file_info. Process:
code_server.

=ERROR REPORT==== 16-May-2013::10:44:37 ===
File operation error: eacces. Target: ./beam_lib.beam. Function: get_file.
Process: code_server.

=ERROR REPORT==== 16-May-2013::10:44:37 ===
File operation error: eacces. Target: ./ram_file.beam. Function: get_file.
Process: code_server.

=ERROR REPORT==== 16-May-2013::10:44:37 ===
File operation error: eacces. Target: ./standard_error.beam. Function:
get_file. Process: code_server.

=ERROR REPORT==== 16-May-2013::10:44:37 ===
File operation error: eacces. Target: ./supervisor_bridge.beam. Function:
get_file. Process: code_server.

=ERROR REPORT==== 16-May-2013::10:44:37 ===
File operation error: eacces. Target: ./user_sup.beam. Function: get_file.
Process: code_server.

=ERROR REPORT==== 16-May-2013::10:44:37 ===
File operation error: eacces. Target: ./user_drv.beam. Function: get_file.
Process: code_server.

=ERROR REPORT==== 16-May-2013::10:44:37 ===
File operation error: eacces. Target: ./group.beam. Function: get_file.
Process: code_server.

=ERROR REPORT==== 16-May-2013::10:44:37 ===
File operation error: eacces. Target: ./edlin.beam. Function: get_file.
Process: code_server.

=ERROR REPORT==== 16-May-2013::10:44:37 ===
{lost_messages,6}

=ERROR REPORT==== 16-May-2013::14:44:37 ===
File operation error: eacces. Target: ./error_logger_tty_h.beam. Function:
get_file. Process: code_server.

=ERROR REPORT==== 16-May-2013::14:44:37 ===
File operation error: eacces. Target: ./calendar.beam. Function: get_file.
Process: code_server.

=ERROR REPORT==== 16-May-2013::14:44:37 ===
File operation error: eacces. Target: ./io_lib_pretty.beam. Function:
get_file. Process: code_server.

=ERROR REPORT==== 16-May-2013::14:44:37 ===
File operation error: eacces. Target: ./io.beam. Function: get_file.
Process: code_server.

=ERROR REPORT==== 16-May-2013::14:44:37 ===
File operation error: eacces. Target: ./erl_scan.beam. Function: get_file.
Process: code_server.

=ERROR REPORT==== 16-May-2013::14:44:37 ===
File operation error: eacces. Target: ./c.beam. Function: get_file.
Process: code_server.

=ERROR REPORT==== 16-May-2013::14:44:37 ===
File operation error: eacces. Target: ./erl_eval.beam. Function: get_file.
Process: code_server.

=ERROR REPORT==== 16-May-2013::14:44:37 ===
File operation error: eacces. Target: ./orddict.beam. Function: get_file.
Process: code_server.

=ERROR REPORT==== 16-May-2013::14:44:37 ===
File operation error: eacces. Target: ./file_io_server.beam. Function:
get_file. Process: code_server.

=ERROR REPORT==== 16-May-2013::14:44:37 ===
File operation error: eacces. Target: ./erl_posix_msg.beam. Function:
get_file. Process: code_server.

=ERROR REPORT==== 16-May-2013::14:44:37 ===
file:path_eval([".","/var/lib/rabbitmq"],".erlang"): permission denied

=ERROR REPORT==== 16-May-2013::14:44:37 ===
File operation error: eacces. Target: ./dist_util.beam. Function: get_file.
Process: code_server.
Eshell V5.9.3.1  (abort with ^G)




-------------------------------------------------

(rabbit at rabbit-box)1> whereis(rabbit).
<0.251.0>


-------------------------------------------------

(rabbit at rabbit-box)2> application:loaded_applications().
[{public_key,"Public key infrastructure","0.17"},
 {rabbitmq_management_agent,"RabbitMQ Management Agent",
                            "3.1.0"},
 {os_mon,"CPO  CXC 138 46","2.2.10"},
 {amqp_client,"RabbitMQ AMQP Client","3.1.0"},
 {mnesia,"MNESIA  CXC 138 12","4.7.1"},
 {rabbitmq_shovel,"Data Shovel for RabbitMQ","3.1.0"},
 {inets,"INETS  CXC 138 49","5.9.2"},
 {rabbitmq_federation,"RabbitMQ Federation","3.1.0"},
 {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.1.0"},
 {rabbitmq_federation_management,"RabbitMQ Federation Management",
                                 "3.1.0"},
 {rabbitmq_management_visualiser,"RabbitMQ Visualiser",
                                 "3.1.0"},
 {kernel,"ERTS  CXC 138 10","2.15.3"},
 {crypto,"CRYPTO version 2","2.2"},
 {ssl,"Erlang/OTP SSL application","5.1.2"},
 {sasl,"SASL  CXC 138 11","2.2.1"},
 {rabbitmq_management,"RabbitMQ Management Console","3.1.0"},
 {mochiweb,"MochiMedia Web Server",
           "2.3.1-rmq3.1.0-gitd541e9a"},
 {xmerl,"XML parser","1.3.2"},
 {rabbit,"RabbitMQ","3.1.0"},
 {stdlib,"ERTS  CXC 138 10","1.18.3"},
 {webmachine,"webmachine","1.9.1-rmq3.1.0-git52e62bc"}]


------------------------------------------------------------------------------------


(rabbit at rabbit-box)3> application:which_applications().
** exception exit: {timeout,{gen_server,call,

[application_controller,which_applications]}}
     in function  gen_server:call/2 (gen_server.erl, line 180)


-----------------------------------------------------------------------------------

(rabbit at rabbit-box)4> rabbit:stop().


-- This hangs indefinitely


------------------------------------------------------------------------------------

Perms on /var/lib/rabbitmq
drwxr-xr-x  3 rabbitmq rabbitmq 4096 May  1 07:15 rabbitmq




Thanks Tim et al.




On Thu, May 9, 2013 at 10:00 AM, Tim Watson <tim at rabbitmq.com> wrote:

> Hi,
>
> On 8 May 2013, at 18:22, Eric Berg wrote:
>
> I have ready through many of these nodedown error emails and of course
> none of them seem to be exactly what I am experiencing.
>
> I have a 4 node cluster, and one of the nodes went offline according to
> the cluster. This box has the following in the sasl log:
>
> =SUPERVISOR REPORT==== 7-May-2013::14:37:22 ===
>      Supervisor: {<0.11197.1096>,
>                                            rabbit_channel_sup_sup}
>      Context:    shutdown_error
>      Reason:     noproc
>      Offender:   [{pid,<0.11199.1096>},
>                   {name,channel_sup},
>                   {mfa,{rabbit_channel_sup,start_link,[]}},
>                   {restart_type,temporary},
>                   {shutdown,infinity},
>                   {child_type,supervisor}]
>
>
> This simply indicates that and error occurred whilst a supervised process
> was shutting down. It's not indicative of the whole node going down -
> Erlang allows processes to crash and be restarted whilst the system is
> running.
>
> *Yet in the regular rabbit log i can see that it was still accepting
> connections up until 2:22AM the next day:*
>
> (last log entry)
> =INFO REPORT==== 8-May-2013::02:22:26 ===
> closing AMQP connection <0.18267.1145> (IPADDRESS:PORT -> IPADDRESS:PORT)
>
>
> So clearly that node didn't actually go offline. The 'nodedown' message in
> the other clustered broker's logs does not necessarily mean that the node
> in question crashed; This could, for example, be indicative of a net-split
> or other connectivity failure.
>
> *Running rabbitmqctl status returns:*
>
> [root at rabbit-box rabbitmq]# rabbitmqctl status
> Status of node 'rabbit at rabbit-box' ...
> Error: unable to connect to node 'rabbit at rabbit-box': nodedown
>
> DIAGNOSTICS
> ===========
>
> nodes in question: ['rabbit at rabbit-box']
>
> hosts, their running nodes and ports:
> - rabbit-box: [{rabbit,13957},{rabbitmqctl2301,16508}]
>
> current node details:
> - node name: 'rabbitmqctl2301 at rabbit-box'
> - home dir: /var/lib/rabbitmq
> - cookie hash: qQwyFW90ZNbbrFvX1AtrxQ==
>
>
> Have you tried running this using `sudo' instead of as root? Is the
> rabbitmq user's account and home folder in a consistent state? The security
> cookie used for inter-node communications, which includes communication
> between the temporary `rabbitmqctl' node and the broker, has to be the same
> for all the peers.
>
> A couple of notes:
> - Looking for a process run by rabbit show that it appears to still be
> running
>
>
> Yes - as I said, there's no indication that this node actually died from
> what you've said. However `rabbitmqctl` should be able to connect to
> rabbit at rabbit-box at the very least.
>
> - Erlang cookie is the same on all nodes of the cluster, the cookie hash
> is the same as well
>
>
> If it's not the cookies then....
>
> - A traffic spike occurred right around the time of the last entry in the
> rabbit log
>
>
> It sounds like this could be a potential culprit. Can you provide any more
> information about what happened? It could be that whilst the network was
> saturated, the node in question got disconnected from the other nodes in
> the cluster because it exceeded the "net tick time" and subsequently things
> have started to go wrong. That shouldn't happen, viz the node should be
> able to re-establish connectivity, but it's possible that something's gone
> wrong here.
>
> What that doesn't explain is why you can't connect from rabbitmqctl. If
> you `su rabbitmq', can you then run `erl -sname debug -remsh
> rabbit at rabbit-box' to establish a shell into the running broker? If that
> does work, then you can stop the rabbit application and then the node, as
> follows:
>
> > rabbit:stop().
> ok
> > init:stop().
>
> But before you do, it might be worth evaluating a couple of other things
> that might help us identify what's going on:
>
> (rabbit at iske)1> whereis(rabbit).
> <0.152.0>
> (rabbit at iske)2> application:loaded_applications().
> [{os_mon,"CPO  CXC 138 46","2.2.9"},
>  {rabbitmq_management_agent,"RabbitMQ Management Agent",
>                             "0.0.0"},
>  {amqp_client,"RabbitMQ AMQP Client","0.0.0"},
>  etc ...
>  ]
> (rabbit at iske)3> application:which_applications().
> [{rabbitmq_shovel_management,"Shovel Status","0.0.0"},
>  etc ...
> ]
>
> If during any of these you get stuck, CTRL-C (and press the key for
> 'abort') should get you back out again without incident.
>
>
> - I can find no other errors in any logs that relate to rabbit or erlang
> - Up until this point the cluster has been running fine for over 40 days.
>
> - telnet IP_ADDRESS 5672 times out
>
>
> So the broker is no longer accepting new AMQP connections then.
> Something's clearly quite wrong with this node.
>
> - I have not restarted the box, erlang node, or entire rabbitmq-server
>
> Is there anywhere else I can go looking for errors? I am about to start
> killing processs, but Im not sure that will solve anything.
>
>
> Did you do that in the end? If not, I would really like to get to the
> bottom of what's wrong with this node. I don't suppose it would be possible
> for you to give us access to this machine would it? If necessary, we may be
> able to get some kind of confidentiality agreement signed if that'd help.
>
> Cheers,
>
> Tim Watson
> Staff Engineer
> RabbitMQ
>
>
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130516/31b6b51a/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: rabbit-manychannels.png
Type: image/png
Size: 127694 bytes
Desc: not available
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130516/31b6b51a/attachment.png>


More information about the rabbitmq-discuss mailing list