<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Hi,<div><br><div><div>On 8 May 2013, at 18:22, Eric Berg wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div dir="ltr">I have ready through many of these nodedown error emails and of course none of them seem to be exactly what I am experiencing.<div><br></div><div style="">I have a 4 node cluster, and one of the nodes went offline according to the cluster. This box has the following in the sasl log:</div>
<div style=""><br></div><div style=""><div>=SUPERVISOR REPORT==== 7-May-2013::14:37:22 ===</div><div> Supervisor: {<0.11197.1096>,</div><div> rabbit_channel_sup_sup}</div><div>
Context: shutdown_error</div><div> Reason: noproc</div><div> Offender: [{pid,<0.11199.1096>},</div><div> {name,channel_sup},</div><div> {mfa,{rabbit_channel_sup,start_link,[]}},</div>
<div> {restart_type,temporary},</div><div> {shutdown,infinity},</div><div> {child_type,supervisor}]</div><div><br></div></div></div></blockquote><div><br></div><div>This simply indicates that and error occurred whilst a supervised process was shutting down. It's not indicative of the whole node going down - Erlang allows processes to crash and be restarted whilst the system is running.</div><br><blockquote type="cite"><div dir="ltr"><div style=""><div style=""><b>Yet in the regular rabbit log i can see that it was still accepting connections up until 2:22AM the next day:</b></div>
<div style=""><br></div><div style="">(last log entry)</div><div style=""><div>=INFO REPORT==== 8-May-2013::02:22:26 ===</div><div>closing AMQP connection <0.18267.1145> (IPADDRESS:PORT -> IPADDRESS:PORT)</div><div><br></div></div></div></div></blockquote><div><br></div><div>So clearly that node didn't actually go offline. The 'nodedown' message in the other clustered broker's logs does not necessarily mean that the node in question crashed; This could, for example, be indicative of a net-split or other connectivity failure. </div><br><blockquote type="cite"><div dir="ltr"><div style=""><div style="">
<div style=""><b>Running rabbitmqctl status returns:</b></div><div style=""><br></div><div style=""><div>[root@rabbit-box rabbitmq]# rabbitmqctl status</div><div>Status of node 'rabbit@rabbit-box' ...</div><div>Error: unable to connect to node 'rabbit@rabbit-box': nodedown</div>
<div><br></div><div>DIAGNOSTICS</div><div>===========</div><div><br></div><div>nodes in question: ['rabbit@rabbit-box']</div><div><br></div><div>hosts, their running nodes and ports:</div><div>- rabbit-box: [{rabbit,13957},{rabbitmqctl2301,16508}]</div>
<div><br></div><div>current node details:</div><div>- node name: 'rabbitmqctl2301@rabbit-box'</div><div>- home dir: /var/lib/rabbitmq</div><div>- cookie hash: qQwyFW90ZNbbrFvX1AtrxQ==</div></div></div></div></div></blockquote><div><br></div><div>Have you tried running this using `sudo' instead of as root? Is the rabbitmq user's account and home folder in a consistent state? The security cookie used for inter-node communications, which includes communication between the temporary `rabbitmqctl' node and the broker, has to be the same for all the peers.</div><br><blockquote type="cite"><div dir="ltr"><div style=""><div style=""><div style=""><div style="">A couple of notes:</div><div style="">- Looking for a process run by rabbit show that it appears to still be running</div></div></div></div></div></blockquote><div><br></div><div>Yes - as I said, there's no indication that this node actually died from what you've said. However `rabbitmqctl` should be able to connect to rabbit@rabbit-box at the very least. </div><br><blockquote type="cite"><div dir="ltr"><div style=""><div style=""><div style=""><div style="">- Erlang cookie is the same on all nodes of the cluster, the cookie hash is the same as well</div></div></div></div></div></blockquote><div><br></div><div>If it's not the cookies then....</div><br><blockquote type="cite"><div dir="ltr"><div style=""><div style=""><div style=""><div style="">- A traffic spike occurred right around the time of the last entry in the rabbit log</div></div></div></div></div></blockquote><div><br></div><div>It sounds like this could be a potential culprit. Can you provide any more information about what happened? It could be that whilst the network was saturated, the node in question got disconnected from the other nodes in the cluster because it exceeded the "net tick time" and subsequently things have started to go wrong. That shouldn't happen, viz the node should be able to re-establish connectivity, but it's possible that something's gone wrong here.</div><div><br></div><div>What that doesn't explain is why you can't connect from rabbitmqctl. If you `su rabbitmq', can you then run `erl -sname debug -remsh rabbit@rabbit-box' to establish a shell into the running broker? If that does work, then you can stop the rabbit application and then the node, as follows:</div><div><br></div><div>> rabbit:stop().</div><div>ok</div><div>> init:stop().</div><div><br></div><div>But before you do, it might be worth evaluating a couple of other things that might help us identify what's going on:</div><div><br></div><div>(rabbit@iske)1> whereis(rabbit).</div><div><0.152.0></div><div>(rabbit@iske)2> application:loaded_applications().</div><div>[{os_mon,"CPO CXC 138 46","2.2.9"},</div><div> {rabbitmq_management_agent,"RabbitMQ Management Agent",</div><div> "0.0.0"},</div><div> {amqp_client,"RabbitMQ AMQP Client","0.0.0"},</div><div> etc ...</div><div> ]</div><div>(rabbit@iske)3> application:which_applications(). </div><div>[{rabbitmq_shovel_management,"Shovel Status","0.0.0"},</div><div> etc ...</div><div>]</div><div> </div><div>If during any of these you get stuck, CTRL-C (and press the key for 'abort') should get you back out again without incident.</div><div><br></div><br><blockquote type="cite"><div dir="ltr"><div style=""><div style=""><div style=""><div style="">- I can find no other errors in any logs that relate to rabbit or erlang</div><div style="">- Up until this point the cluster has been running fine for over 40 days.</div></div></div></div></div></blockquote><blockquote type="cite"><div dir="ltr"><div style=""><div style=""><div style=""><div style="">- telnet IP_ADDRESS 5672 times out</div></div></div></div></div></blockquote><div><br></div><div>So the broker is no longer accepting new AMQP connections then. Something's clearly quite wrong with this node.</div><br><blockquote type="cite"><div dir="ltr"><div style=""><div style=""><div style=""><div style="">- I have not restarted the box, erlang node, or entire rabbitmq-server</div><div style=""><br></div><div style="">Is there anywhere else I can go looking for errors? I am about to start killing processs, but Im not sure that will solve anything.</div>
<div style=""><br></div></div></div></div></div></blockquote><div><br></div><div>Did you do that in the end? If not, I would really like to get to the bottom of what's wrong with this node. I don't suppose it would be possible for you to give us access to this machine would it? If necessary, we may be able to get some kind of confidentiality agreement signed if that'd help.</div><div><br></div><div>Cheers,</div><div><br></div><div>Tim Watson</div><div>Staff Engineer</div><div>RabbitMQ</div><div><br></div><div><br></div><div><br></div></div></div></body></html>