[rabbitmq-discuss] Odd behavior where server stops responding

Wed Mar 12 17:23:59 GMT 2014

The last time I saw this I found an article talking about how the OEL
kernels on 6.2 for servers with large amounts of ram had a high buffer
cache situation on the journal before flushing to disk which would cause
long IO times and potentially crash the OS or other processes.  We just saw
this on a completely different server the other morning not running
rabbitmq (all kinds of stuff in the logs about 120 second wait times, ssh
stops responding, have to get into the ilo or vmware console to see these).
 The OS just stopped responding though in these cases, so it might not be
related to the issues I saw on these three servers.  But this is why I
suspect an issue with the kernel and disk IO, particularly since lsof
wasn't working, and seeing the software raid checks in the logs.  Note,
when I DID kill rabbit and epmd, I checked lsof on both the /data partition
where I have the mnesia database, and the /var/log looking for any sign of
a problem, and it did work when rabbit wasn't running.

When I say "refusing to start" the init script never completed and exited
out.  It started the process, and that process never responded (guessing)
or the init script would have finished whatever it does.  I wish I still
had the ps aufx results I took when I saw this - it looked fairly normal
with some things missing.  Here's what we have now:
root      2484  0.0  0.0 108488  1116 ?        S    Mar11   0:00 /bin/sh
/etc/rc3.d/S80rabbitmq-server start
root      2490  0.0  0.0 108216  1444 ?        S    Mar11   0:00  \_
/bin/bash -c ulimit -S -c 0 >/dev/null 2>&1 ; /usr/sbin/rabbitmq-server
root      2493  0.0  0.0 108216  1488 ?        S    Mar11   0:00      \_
/bin/sh /usr/sbin/rabbitmq-server
root      2506  0.0  0.0 163228  2160 ?        S    Mar11   0:00
 \_ su rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmq-server
rabbitmq  2511  5.9  0.0 3881272 79464 ?       Ssl  Mar11  87:11
   \_ /usr/lib/erlang/erts-5.9.3.1/bin/beam.smp -W w -K true -A30 -P
1048576 -- -root /usr/lib/erlang -progname erl -- -h
rabbitmq  2676  0.0  0.0  10844   540 ?        Ss   Mar11   0:00
       \_ inet_gethost 4
rabbitmq  2677  0.0  0.0  12948   712 ?        S    Mar11   0:00
           \_ inet_gethost 4

When I was having the issues, I saw duplicate of the beam.smp processes
running, and what appeared to be duplicate start commands as well, but no
inet_gethost 4 running.  A netstat showed nothing on 5672 or 15672 neither
listening or connected nor in any other state.

By "telneting to the port" I mean literally I did these:
telnet <server> 5672
telnet <server> 15672
Enter a bunch of garbage in both cases, nothing happens.  At the very least
on the management port, typing "GET /" should have given a response with a
request for authorization or similar.  And yeah before doing this, beam and
the epmd process were running.

Please note these servers are NOT clustered together - they're completely
independent.  They all just failed around the same time (about a four hour
window).  They'd been started about the same time back in August.

Next time I see this I'll try and get an strace on the beam process, see if
I can get some more information.  As said - definitely SOMETHING screwy
here, and the way it behaved felt very much like an OS lock of some sort
someplace, semaphore lock, or something else similar, and I'd guess rabbit
(well, erlang) didn't handle it very well.  And the only thing I found even
remotely suspicious was the md check in the syslog:

Mar  9 03:22:02 mongobdc14p kernel: imklog 4.6.2, log source = /proc/kmsg
started.
Mar  9 03:22:02 mongobdc14p rsyslogd: [origin software="rsyslogd"
swVersion="4.6.2" x-pid="5815" x-info="http://www.rsyslog.com"] (re)start
Mar  9 04:48:19 mongobdc14p kernel: md: md6: data-check done.
Mar  9 04:48:19 mongobdc14p kernel: md: data-check of RAID array md5
Mar  9 04:48:19 mongobdc14p kernel: md: minimum _guaranteed_  speed: 1000
KB/sec/disk.
Mar  9 04:48:19 mongobdc14p kernel: md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for data-check.
Mar  9 04:48:19 mongobdc14p kernel: md: using 128k window, over a total of
104856508 blocks.
Mar  9 04:48:19 mongobdc14p kernel: md: delaying data-check of md0 until
md5 has finished (they share one or more physical units)
Mar  9 04:48:19 mongobdc14p kernel: md: delaying data-check of md1 until
md5 has finished (they share one or more physical units)
Mar  9 05:03:45 mongobdc14p kernel: md: md5: data-check done.
Mar  9 05:03:45 mongobdc14p kernel: md: data-check of RAID array md0
Mar  9 05:03:45 mongobdc14p kernel: md: minimum _guaranteed_  speed: 1000
KB/sec/disk.
Mar  9 05:03:45 mongobdc14p kernel: md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for data-check.
Mar  9 05:03:45 mongobdc14p kernel: md: using 128k window, over a total of
524276 blocks.
Mar  9 05:03:45 mongobdc14p kernel: md: delaying data-check of md1 until
md0 has finished (they share one or more physical units)
Mar  9 05:03:50 mongobdc14p kernel: md: md0: data-check done.
Mar  9 05:03:50 mongobdc14p kernel: md: data-check of RAID array md1
Mar  9 05:03:50 mongobdc14p kernel: md: minimum _guaranteed_  speed: 1000
KB/sec/disk.
Mar  9 05:03:50 mongobdc14p kernel: md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for data-check.
Mar  9 05:03:50 mongobdc14p kernel: md: using 128k window, over a total of
31456188 blocks.
Mar  9 05:08:53 mongobdc14p kernel: md: md1: data-check done.
Mar 10 14:40:01 mongobdc14p auditd[21812]: Audit daemon rotating log files
Mar 11 11:46:46 mongobdc14p init: Failed to spawn splash-manager main
process: unable to open console: Input/output error

(11:46 was when I finally gave up trying to debug and just restarted the
whole server so you can ignore the io error on the splash-manager).  Cron
jobs all were running fine, no problems anyplace that I could actually see.
 Before things stopped responding on the 11th, rabbit used a few more file
descriptors then stopped responding about 10am.  Memory was a bit spiky but
nothing out of the ordinary.  Disk free hadn't changed at all to speak of
(few mb here up and down).

9.1 rabbitmq /usr/lib/erlang/erts-5.9.3.1/bin/beam.smp -W w -K true -A30 -P
1048576 -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq
-- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.0.4/sbin/../ebin -noshell
-noinput -s rabbit boot -sname rabbit at mongobdc14p -boot start_sasl -kernel
inet_default_connect_options [{nodelay,true}] -sasl errlog_type error -sasl
sasl_error_logger false -rabbit error_logger
{file,"/data/rabbitmq/rabbit at mongobdc14p.log"} -rabbit sasl_error_logger
{file,"/data/rabbitmq/rabbit at mongobdc14p-sasl.log"} -rabbit
enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir
"/usr/lib/rabbitmq/lib/rabbitmq_server-3.0.4/sbin/../plugins" -rabbit
plugins_expand_dir
"/data/rabbitmq/rabbitmq/mnesia/rabbit at mongobdc14p-plugins-expand" -os_mon
start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false
-mnesia dir "/data/rabbitmq/rabbitmq/mnesia/rabbit at mongobdc14p"
If you'd like, I've got more graphs and data i can share (I had the zabbix
agent running the zabbix-rabbitmq monitoring stuff I wrote on that box),
but I really haven't been able to find anything wrong.  Just the management
API at 10:32 stopped responding, context switches on the kernel suddenly
dropped, it's like everything just suddenly stopped doing anything.

At this point I'd chalk this up to "it's an anomoly that shouldn't require
a whole lot of investigation" kinda situation though so don't worry about
it too heavily.  Or blame Oracle - I like blaming Oracle :)
Jason

On Wed, Mar 12, 2014 at 11:45 AM, Tim Watson <tim at rabbitmq.com> wrote:

> Hi Jason,
>
> On 12 Mar 2014, at 14:32, Jason McIntosh wrote:
>
> > I never could get any node up - nothing showed up in the logs for
> startup, shutdown, regular and sasl logs.  One interesting thing is on the
> startup after killing all the processes (including epmd) it appears to have
> started multiple beams instead of the typical one.
>
> Well, if there are rabbit (i.e., beam.smp) processes running, then you
> _do_ have a node up, though not necessarily responding properly. That's not
> the same as the program refusing to start though.
>
> >
> > By non-responsive, rabbitmqctl shows the node as being down,
>
> Could this be a file system corruption issue? Have you checked all the
> usual suspects for when rabbitmqctl won't connect to a node that you know
> is running, like checking the erlang cookies match?
>
> > but I could telnet to both the management port and the rabbitmq port
> (which I'm guessing is the epmd process)
>
> I'm not really sure what you mean by "the rabbitmq port", but we could be
> talking about several things here. (a) the port on which the broker accepts
> AMQP connections, (b) the port on which the broker accepts distributed
> erlang connections (which is assigned by epmd) or something else (?) so I'm
> not really sure what we're saying here.
>
> > , but nothing shows in the log files for rabbit itself, nothing in the
> sasl logs, no content from the management port, etc.
>
> Hmn, so you're sure (via ps and/or top) that there are beam.smp processes
> running, and you can see (via netstat) that the management HTTP port is in
> use, but there's no response from the HTTP (management) server?
>
> >  The OS was completely responsive - I could get to the shell, do most
> commands, though lsof (I think I mentioned this) wouldn't respond in any
> timely manner while rabbit server was running.
>
> That _is_ very strange.
>
> >  All the ways I know of to talk to rabbit and all the os things i know
> to do were failing and all the things I know to try and restart it (killing
> the processes, waiting for network connections in TIM_WAIT to drain,
> killing EPMD as well) all failed.
> >
> > At that point, I recycled each of the servers and they're back to a
> running state.  I don't know that I have a box handy but I'll see what I
> can do to replicate this.  For some reason I think I've seen this before
> and it's something with the OEL 6.2 kernel that Oracle put together,
> dealing with disk io flush on a journaled file system.  I'm pretty sure
> I've seen the same thing about 6 months ago or so on a completely different
> set of servers.
>
> Urgh, that sounds horrible. The more info you can provide us with the
> better. If you can replicate, that would be amazing since we can do the
> same thing and investigate.
>
> Cheers,
> Tim
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>

-- 
Jason McIntosh
https://github.com/jasonmcintosh/
573-424-7612
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140312/3f1beb2d/attachment.html>