[rabbitmq-discuss] Odd behavior where server stops responding

Wed Mar 12 14:32:42 GMT 2014

I never could get any node up - nothing showed up in the logs for startup,
shutdown, regular and sasl logs.  One interesting thing is on the startup
after killing all the processes (including epmd) it appears to have started
multiple beams instead of the typical one.

By non-responsive, rabbitmqctl shows the node as being down, but I could
telnet to both the management port and the rabbitmq port (which I'm
guessing is the epmd process), but nothing shows in the log files for
rabbit itself, nothing in the sasl logs, no content from the management
port, etc.  The OS was completely responsive - I could get to the shell, do
most commands, though lsof (I think I mentioned this) wouldn't respond in
any timely manner while rabbit server was running.  All the ways I know of
to talk to rabbit and all the os things i know to do were failing and all
the things I know to try and restart it (killing the processes, waiting for
network connections in TIM_WAIT to drain, killing EPMD as well) all failed.

At that point, I recycled each of the servers and they're back to a running
state.  I don't know that I have a box handy but I'll see what I can do to
replicate this.  For some reason I think I've seen this before and it's
something with the OEL 6.2 kernel that Oracle put together, dealing with
disk io flush on a journaled file system.  I'm pretty sure I've seen the
same thing about 6 months ago or so on a completely different set of
servers.

Jason

On Wed, Mar 12, 2014 at 4:39 AM, Tim Watson <tim at rabbitmq.com> wrote:

>
> On 11 Mar 2014, at 16:41, Jason McIntosh wrote:
>
> > This may be a 3.0.4-1 issue (or erlang esl-erlang-R15B03-2.x86_64) and I
> just may need to upgrade rabbit but I thought I'd see if anyone had seen
> this before.  This is on oracle enterprise linux 6.2.
> >
> > A few days after a software raid filesystem check (may or not be
> related, only thing I can see that's common), on three separate servers,
> rabbit just completely hung.  I couldn't even do an LSOF on any file system.
>
> That sounds like the OS hung not rabbit?
>
> > The port was still responsive, but the rabbit process itself was
> completely hung.  CPU use jumped up to 25% on the beam process and just
> stayed there...
>
> 25% isn't exactly massive, though if the spike in CPU wasn't associated
> with an increase in messaging traffic then something could be wrong.
>
> >  I killed all the rabbit processes and tried to restart them.  There was
> nothing in the logs and the startup failed
>
> How did it fail? Non-zero return code(s) or more?
>
> > I killed everything again, including the EPMD process and then rabbit
> was finally able to start.  With-in a few moments though the beam hung
> again - I see a few connections show in the logs and then the process is
> non-responsive.
>
> How did the beam process hang and in what way did the broker become
> unresponsive? How are you trying to interact with the broker? Are we
> talking about rabbitmqctl commands not responding, or something else?
>
> >
> > I'm GUESSING this is an OS level issue, and I'd swear I've seen this
> before.
>
> I've not heard of this before either. If you can get a rabbit node up and
> running then we may be able to run some diagnostic commands, depending on
> whether or not rabbitmqctl commands are working (or not). There are other
> ways to access a console on a running node too, though they're a bit more
> involved.
>
> >  I've had to do a full restart of the server to get things back to a
> decent state.  Anyone have any advice/ideas?
> >
>
> Please clarify what we mean by hung and non-responsive. Also, do you now
> have the broker running without any of these issues? Can you fire up a
> standalone (test) node in a VM or some such, run the broker (successfully)
> on it and re-run the raid check to see if it causes the same problem?
>
> It would be good to get to the bottom of this, though clearly without any
> logs it's going to be tricky.
>
> Cheers,
> Tim
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>

-- 
Jason McIntosh
https://github.com/jasonmcintosh/
573-424-7612
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140312/e9de56f1/attachment.html>