[rabbitmq-discuss] Odd behavior where server stops responding

Wed Mar 12 16:45:13 GMT 2014

Hi Jason,

On 12 Mar 2014, at 14:32, Jason McIntosh wrote:

> I never could get any node up - nothing showed up in the logs for startup, shutdown, regular and sasl logs.  One interesting thing is on the startup after killing all the processes (including epmd) it appears to have started multiple beams instead of the typical one.  

Well, if there are rabbit (i.e., beam.smp) processes running, then you _do_ have a node up, though not necessarily responding properly. That's not the same as the program refusing to start though.

> 
> By non-responsive, rabbitmqctl shows the node as being down,

Could this be a file system corruption issue? Have you checked all the usual suspects for when rabbitmqctl won't connect to a node that you know is running, like checking the erlang cookies match?

> but I could telnet to both the management port and the rabbitmq port (which I'm guessing is the epmd process)

I'm not really sure what you mean by "the rabbitmq port", but we could be talking about several things here. (a) the port on which the broker accepts AMQP connections, (b) the port on which the broker accepts distributed erlang connections (which is assigned by epmd) or something else (?) so I'm not really sure what we're saying here. 

> , but nothing shows in the log files for rabbit itself, nothing in the sasl logs, no content from the management port, etc.

Hmn, so you're sure (via ps and/or top) that there are beam.smp processes running, and you can see (via netstat) that the management HTTP port is in use, but there's no response from the HTTP (management) server?

>  The OS was completely responsive - I could get to the shell, do most commands, though lsof (I think I mentioned this) wouldn't respond in any timely manner while rabbit server was running.

That _is_ very strange.

>  All the ways I know of to talk to rabbit and all the os things i know to do were failing and all the things I know to try and restart it (killing the processes, waiting for network connections in TIM_WAIT to drain, killing EPMD as well) all failed.
> 
> At that point, I recycled each of the servers and they're back to a running state.  I don't know that I have a box handy but I'll see what I can do to replicate this.  For some reason I think I've seen this before and it's something with the OEL 6.2 kernel that Oracle put together, dealing with disk io flush on a journaled file system.  I'm pretty sure I've seen the same thing about 6 months ago or so on a completely different set of servers.

Urgh, that sounds horrible. The more info you can provide us with the better. If you can replicate, that would be amazing since we can do the same thing and investigate.

Cheers,
Tim