[rabbitmq-discuss] Someone else with a nodedown error

Thu May 9 15:00:30 BST 2013

Hi,

On 8 May 2013, at 18:22, Eric Berg wrote:

> I have ready through many of these nodedown error emails and of course none of them seem to be exactly what I am experiencing.
> 
> I have a 4 node cluster, and one of the nodes went offline according to the cluster. This box has the following in the sasl log:
> 
> =SUPERVISOR REPORT==== 7-May-2013::14:37:22 ===
>      Supervisor: {<0.11197.1096>,
>                                            rabbit_channel_sup_sup}
>      Context:    shutdown_error
>      Reason:     noproc
>      Offender:   [{pid,<0.11199.1096>},
>                   {name,channel_sup},
>                   {mfa,{rabbit_channel_sup,start_link,[]}},
>                   {restart_type,temporary},
>                   {shutdown,infinity},
>                   {child_type,supervisor}]
> 

This simply indicates that and error occurred whilst a supervised process was shutting down. It's not indicative of the whole node going down - Erlang allows processes to crash and be restarted whilst the system is running.

> Yet in the regular rabbit log i can see that it was still accepting connections up until 2:22AM the next day:
> 
> (last log entry)
> =INFO REPORT==== 8-May-2013::02:22:26 ===
> closing AMQP connection <0.18267.1145> (IPADDRESS:PORT -> IPADDRESS:PORT)
> 

So clearly that node didn't actually go offline. The 'nodedown' message in the other clustered broker's logs does not necessarily mean that the node in question crashed; This could, for example, be indicative of a net-split or other connectivity failure. 

> Running rabbitmqctl status returns:
> 
> [root at rabbit-box rabbitmq]# rabbitmqctl status
> Status of node 'rabbit at rabbit-box' ...
> Error: unable to connect to node 'rabbit at rabbit-box': nodedown
> 
> DIAGNOSTICS
> ===========
> 
> nodes in question: ['rabbit at rabbit-box']
> 
> hosts, their running nodes and ports:
> - rabbit-box: [{rabbit,13957},{rabbitmqctl2301,16508}]
> 
> current node details:
> - node name: 'rabbitmqctl2301 at rabbit-box'
> - home dir: /var/lib/rabbitmq
> - cookie hash: qQwyFW90ZNbbrFvX1AtrxQ==

Have you tried running this using `sudo' instead of as root? Is the rabbitmq user's account and home folder in a consistent state? The security cookie used for inter-node communications, which includes communication between the temporary `rabbitmqctl' node and the broker, has to be the same for all the peers.

> A couple of notes:
> - Looking for a process run by rabbit show that it appears to still be running

Yes - as I said, there's no indication that this node actually died from what you've said. However `rabbitmqctl` should be able to connect to rabbit at rabbit-box at the very least. 

> - Erlang cookie is the same on all nodes of the cluster, the cookie hash is the same as well

If it's not the cookies then....

> - A traffic spike occurred right around the time of the last entry in the rabbit log

It sounds like this could be a potential culprit. Can you provide any more information about what happened? It could be that whilst the network was saturated, the node in question got disconnected from the other nodes in the cluster because it exceeded the "net tick time" and subsequently things have started to go wrong. That shouldn't happen, viz the node should be able to re-establish connectivity, but it's possible that something's gone wrong here.

What that doesn't explain is why you can't connect from rabbitmqctl. If you `su rabbitmq', can you then run `erl -sname debug -remsh rabbit at rabbit-box' to establish a shell into the running broker? If that does work, then you can stop the rabbit application and then the node, as follows:

> rabbit:stop().
ok
> init:stop().

But before you do, it might be worth evaluating a couple of other things that might help us identify what's going on:

(rabbit at iske)1> whereis(rabbit).
<0.152.0>
(rabbit at iske)2> application:loaded_applications().
[{os_mon,"CPO  CXC 138 46","2.2.9"},
 {rabbitmq_management_agent,"RabbitMQ Management Agent",
                            "0.0.0"},
 {amqp_client,"RabbitMQ AMQP Client","0.0.0"},
 etc ...
 ]
(rabbit at iske)3> application:which_applications(). 
[{rabbitmq_shovel_management,"Shovel Status","0.0.0"},
 etc ...
]

If during any of these you get stuck, CTRL-C (and press the key for 'abort') should get you back out again without incident.

> - I can find no other errors in any logs that relate to rabbit or erlang
> - Up until this point the cluster has been running fine for over 40 days.
> - telnet IP_ADDRESS 5672 times out

So the broker is no longer accepting new AMQP connections then. Something's clearly quite wrong with this node.

> - I have not restarted the box, erlang node, or entire rabbitmq-server
> 
> Is there anywhere else I can go looking for errors? I am about to start killing processs, but Im not sure that will solve anything.
> 

Did you do that in the end? If not, I would really like to get to the bottom of what's wrong with this node. I don't suppose it would be possible for you to give us access to this machine would it? If necessary, we may be able to get some kind of confidentiality agreement signed if that'd help.

Cheers,

Tim Watson
Staff Engineer
RabbitMQ

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130509/7373b798/attachment.htm>