[rabbitmq-discuss] Odd behavior where server stops responding

Wed Mar 12 09:39:00 GMT 2014

On 11 Mar 2014, at 16:41, Jason McIntosh wrote:

> This may be a 3.0.4-1 issue (or erlang esl-erlang-R15B03-2.x86_64) and I just may need to upgrade rabbit but I thought I'd see if anyone had seen this before.  This is on oracle enterprise linux 6.2.  
> 
> A few days after a software raid filesystem check (may or not be related, only thing I can see that's common), on three separate servers, rabbit just completely hung.  I couldn't even do an LSOF on any file system.  

That sounds like the OS hung not rabbit?

> The port was still responsive, but the rabbit process itself was completely hung.  CPU use jumped up to 25% on the beam process and just stayed there...

25% isn't exactly massive, though if the spike in CPU wasn't associated with an increase in messaging traffic then something could be wrong.

>  I killed all the rabbit processes and tried to restart them.  There was nothing in the logs and the startup failed

How did it fail? Non-zero return code(s) or more?

> I killed everything again, including the EPMD process and then rabbit was finally able to start.  With-in a few moments though the beam hung again - I see a few connections show in the logs and then the process is non-responsive.

How did the beam process hang and in what way did the broker become unresponsive? How are you trying to interact with the broker? Are we talking about rabbitmqctl commands not responding, or something else?

> 
> I'm GUESSING this is an OS level issue, and I'd swear I've seen this before.

I've not heard of this before either. If you can get a rabbit node up and running then we may be able to run some diagnostic commands, depending on whether or not rabbitmqctl commands are working (or not). There are other ways to access a console on a running node too, though they're a bit more involved.

>  I've had to do a full restart of the server to get things back to a decent state.  Anyone have any advice/ideas?
> 

Please clarify what we mean by hung and non-responsive. Also, do you now have the broker running without any of these issues? Can you fire up a standalone (test) node in a VM or some such, run the broker (successfully) on it and re-run the raid check to see if it causes the same problem?

It would be good to get to the bottom of this, though clearly without any logs it's going to be tricky.

Cheers,
Tim