<div dir="ltr">I have ready through many of these nodedown error emails and of course none of them seem to be exactly what I am experiencing.<div><br></div><div style>I have a 4 node cluster, and one of the nodes went offline according to the cluster. This box has the following in the sasl log:</div>
<div style><br></div><div style><div>=SUPERVISOR REPORT==== 7-May-2013::14:37:22 ===</div><div> Supervisor: {<0.11197.1096>,</div><div> rabbit_channel_sup_sup}</div><div>
Context: shutdown_error</div><div> Reason: noproc</div><div> Offender: [{pid,<0.11199.1096>},</div><div> {name,channel_sup},</div><div> {mfa,{rabbit_channel_sup,start_link,[]}},</div>
<div> {restart_type,temporary},</div><div> {shutdown,infinity},</div><div> {child_type,supervisor}]</div><div><br></div><div style><b>Yet in the regular rabbit log i can see that it was still accepting connections up until 2:22AM the next day:</b></div>
<div style><br></div><div style>(last log entry)</div><div style><div>=INFO REPORT==== 8-May-2013::02:22:26 ===</div><div>closing AMQP connection <0.18267.1145> (IPADDRESS:PORT -> IPADDRESS:PORT)</div><div><br></div>
<div style><b>Running rabbitmqctl status returns:</b></div><div style><br></div><div style><div>[root@rabbit-box rabbitmq]# rabbitmqctl status</div><div>Status of node 'rabbit@rabbit-box' ...</div><div>Error: unable to connect to node 'rabbit@rabbit-box': nodedown</div>
<div><br></div><div>DIAGNOSTICS</div><div>===========</div><div><br></div><div>nodes in question: ['rabbit@rabbit-box']</div><div><br></div><div>hosts, their running nodes and ports:</div><div>- rabbit-box: [{rabbit,13957},{rabbitmqctl2301,16508}]</div>
<div><br></div><div>current node details:</div><div>- node name: 'rabbitmqctl2301@rabbit-box'</div><div>- home dir: /var/lib/rabbitmq</div><div>- cookie hash: qQwyFW90ZNbbrFvX1AtrxQ==</div><div><br></div><div><br>
</div><div style>A couple of notes:</div><div style>- Looking for a process run by rabbit show that it appears to still be running</div><div style>- Erlang cookie is the same on all nodes of the cluster, the cookie hash is the same as well</div>
<div style>- A traffic spike occurred right around the time of the last entry in the rabbit log</div><div style>- I can find no other errors in any logs that relate to rabbit or erlang</div><div style>- Up until this point the cluster has been running fine for over 40 days.</div>
<div style>- telnet IP_ADDRESS 5672 times out</div><div style>- I have not restarted the box, erlang node, or entire rabbitmq-server</div><div style><br></div><div style>Is there anywhere else I can go looking for errors? I am about to start killing processs, but Im not sure that will solve anything.</div>
<div style><br></div><div style>Thanks!</div><div style><br></div><div style>Eric Berg</div><div style><br></div><div><br></div><div><br></div></div><div><br></div></div></div></div>