<div dir="ltr">The last time I saw this I found an article talking about how the OEL kernels on 6.2 for servers with large amounts of ram had a high buffer cache situation on the journal before flushing to disk which would cause long IO times and potentially crash the OS or other processes.  We just saw this on a completely different server the other morning not running rabbitmq (all kinds of stuff in the logs about 120 second wait times, ssh stops responding, have to get into the ilo or vmware console to see these).  The OS just stopped responding though in these cases, so it might not be related to the issues I saw on these three servers.  But this is why I suspect an issue with the kernel and disk IO, particularly since lsof wasn't working, and seeing the software raid checks in the logs.  Note, when I DID kill rabbit and epmd, I checked lsof on both the /data partition where I have the mnesia database, and the /var/log looking for any sign of a problem, and it did work when rabbit wasn't running.<div>


<br></div><div>When I say "refusing to start" the init script never completed and exited out.  It started the process, and that process never responded (guessing) or the init script would have finished whatever it does.  I wish I still had the ps aufx results I took when I saw this - it looked fairly normal with some things missing.  Here's what we have now:</div>

<div><div>root      2484  0.0  0.0 108488  1116 ?        S    Mar11   0:00 /bin/sh /etc/rc3.d/S80rabbitmq-server start</div><div>root      2490  0.0  0.0 108216  1444 ?        S    Mar11   0:00  \_ /bin/bash -c ulimit -S -c 0 >/dev/null 2>&1 ; /usr/sbin/rabbitmq-server</div>

<div>root      2493  0.0  0.0 108216  1488 ?        S    Mar11   0:00      \_ /bin/sh /usr/sbin/rabbitmq-server</div><div>root      2506  0.0  0.0 163228  2160 ?        S    Mar11   0:00          \_ su rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmq-server</div>

<div>rabbitmq  2511  5.9  0.0 3881272 79464 ?       Ssl  Mar11  87:11              \_ /usr/lib/erlang/erts-5.9.3.1/bin/beam.smp -W w -K true -A30 -P 1048576 -- -root /usr/lib/erlang -progname erl -- -h</div><div>rabbitmq  2676  0.0  0.0  10844   540 ?        Ss   Mar11   0:00                  \_ inet_gethost 4</div>

<div>rabbitmq  2677  0.0  0.0  12948   712 ?        S    Mar11   0:00                      \_ inet_gethost 4</div></div><div><br></div><div>When I was having the issues, I saw duplicate of the beam.smp processes running, and what appeared to be duplicate start commands as well, but no inet_gethost 4 running.  A netstat showed nothing on 5672 or 15672 neither listening or connected nor in any other state.</div>

<div><br></div><div>By "telneting to the port" I mean literally I did these:<br></div><div>telnet <server> 5672</div><div>telnet <server> 15672</div><div>Enter a bunch of garbage in both cases, nothing happens.  At the very least on the management port, typing "GET /" should have given a response with a request for authorization or similar.  And yeah before doing this, beam and the epmd process were running.</div>


<div><br></div><div>Please note these servers are NOT clustered together - they're completely independent.  They all just failed around the same time (about a four hour window).  They'd been started about the same time back in August.</div>


<div><br></div><div>Next time I see this I'll try and get an strace on the beam process, see if I can get some more information.  As said - definitely SOMETHING screwy here, and the way it behaved felt very much like an OS lock of some sort someplace, semaphore lock, or something else similar, and I'd guess rabbit (well, erlang) didn't handle it very well.  And the only thing I found even remotely suspicious was the md check in the syslog:</div>


<div><br></div><div><div>Mar  9 03:22:02 mongobdc14p kernel: imklog 4.6.2, log source = /proc/kmsg started.</div><div>Mar  9 03:22:02 mongobdc14p rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="5815" x-info="<a href="http://www.rsyslog.com" target="_blank">http://www.rsyslog.com</a>"] (re)start</div>


<div>Mar  9 04:48:19 mongobdc14p kernel: md: md6: data-check done.</div><div>Mar  9 04:48:19 mongobdc14p kernel: md: data-check of RAID array md5</div><div>Mar  9 04:48:19 mongobdc14p kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.</div>


<div>Mar  9 04:48:19 mongobdc14p kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.</div><div>Mar  9 04:48:19 mongobdc14p kernel: md: using 128k window, over a total of 104856508 blocks.</div>


<div>Mar  9 04:48:19 mongobdc14p kernel: md: delaying data-check of md0 until md5 has finished (they share one or more physical units)</div><div>Mar  9 04:48:19 mongobdc14p kernel: md: delaying data-check of md1 until md5 has finished (they share one or more physical units)</div>


<div>Mar  9 05:03:45 mongobdc14p kernel: md: md5: data-check done.</div><div>Mar  9 05:03:45 mongobdc14p kernel: md: data-check of RAID array md0</div><div>Mar  9 05:03:45 mongobdc14p kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.</div>


<div>Mar  9 05:03:45 mongobdc14p kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.</div><div>Mar  9 05:03:45 mongobdc14p kernel: md: using 128k window, over a total of 524276 blocks.</div>


<div>Mar  9 05:03:45 mongobdc14p kernel: md: delaying data-check of md1 until md0 has finished (they share one or more physical units)</div><div>Mar  9 05:03:50 mongobdc14p kernel: md: md0: data-check done.</div><div>Mar  9 05:03:50 mongobdc14p kernel: md: data-check of RAID array md1</div>


<div>Mar  9 05:03:50 mongobdc14p kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.</div><div>Mar  9 05:03:50 mongobdc14p kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.</div>


<div>Mar  9 05:03:50 mongobdc14p kernel: md: using 128k window, over a total of 31456188 blocks.</div><div>Mar  9 05:08:53 mongobdc14p kernel: md: md1: data-check done.</div><div>Mar 10 14:40:01 mongobdc14p auditd[21812]: Audit daemon rotating log files</div>


<div>Mar 11 11:46:46 mongobdc14p init: Failed to spawn splash-manager main process: unable to open console: Input/output error</div></div><div><br></div><div>(11:46 was when I finally gave up trying to debug and just restarted the whole server so you can ignore the io error on the splash-manager).  Cron jobs all were running fine, no problems anyplace that I could actually see.  Before things stopped responding on the 11th, rabbit used a few more file descriptors then stopped responding about 10am.  Memory was a bit spiky but nothing out of the ordinary.  Disk free hadn't changed at all to speak of (few mb here up and down).</div>

<div><br></div><div><table class="" cellpadding="3" cellspacing="1" style="empty-cells:show;width:5474px;color:rgb(31,31,31);font-family:verdana,arial,helvetica,sans-serif;font-size:10px;line-height:18px"><tbody><tr class="" style="color:inherit;padding:0px">

<td class="" style="color:inherit;padding:1px 3px;white-space:pre;border:0px;height:22px;border-top-left-radius:2px;border-top-right-radius:2px;border-bottom-right-radius:2px;border-bottom-left-radius:2px;background-color:rgb(242,247,250)">

9.1 rabbitmq /usr/lib/erlang/erts-5.9.3.1/bin/beam.smp -W w -K true -A30 -P 1048576 -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.0.4/sbin/../ebin -noshell -noinput -s rabbit boot -sname rabbit@mongobdc14p -boot start_sasl -kernel inet_default_connect_options [{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/data/rabbitmq/rabbit@mongobdc14p.log"} -rabbit sasl_error_logger {file,"/data/rabbitmq/rabbit@mongobdc14p-sasl.log"} -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir "/usr/lib/rabbitmq/lib/rabbitmq_server-3.0.4/sbin/../plugins" -rabbit plugins_expand_dir "/data/rabbitmq/rabbitmq/mnesia/rabbit@mongobdc14p-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/data/rabbitmq/rabbitmq/mnesia/rabbit@mongobdc14p"</td>

</tr><tr class="" style="color:inherit;padding:0px"></tr></tbody></table><br></div><div>If you'd like, I've got more graphs and data i can share (I had the zabbix agent running the zabbix-rabbitmq monitoring stuff I wrote on that box), but I really haven't been able to find anything wrong.  Just the management API at 10:32 stopped responding, context switches on the kernel suddenly dropped, it's like everything just suddenly stopped doing anything.</div>

<div><br></div><div>At this point I'd chalk this up to "it's an anomoly that shouldn't require a whole lot of investigation" kinda situation though so don't worry about it too heavily.  Or blame Oracle - I like blaming Oracle :)</div>


<div>Jason</div><div><br></div><div class="gmail_extra">On Wed, Mar 12, 2014 at 11:45 AM, Tim Watson <span dir="ltr"><<a href="mailto:tim@rabbitmq.com" target="_blank">tim@rabbitmq.com</a>></span> wrote:<br><div class="gmail_quote">

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Hi Jason,<br>

<div class=""><br>

On 12 Mar 2014, at 14:32, Jason McIntosh wrote:<br>

<br>

> I never could get any node up - nothing showed up in the logs for startup, shutdown, regular and sasl logs.  One interesting thing is on the startup after killing all the processes (including epmd) it appears to have started multiple beams instead of the typical one.<br>


<br>

</div>Well, if there are rabbit (i.e., beam.smp) processes running, then you _do_ have a node up, though not necessarily responding properly. That's not the same as the program refusing to start though.<br>

<div class=""><br>

><br>

> By non-responsive, rabbitmqctl shows the node as being down,<br>

<br>

</div>Could this be a file system corruption issue? Have you checked all the usual suspects for when rabbitmqctl won't connect to a node that you know is running, like checking the erlang cookies match?<br>

<div class=""><br>

> but I could telnet to both the management port and the rabbitmq port (which I'm guessing is the epmd process)<br>

<br>

</div>I'm not really sure what you mean by "the rabbitmq port", but we could be talking about several things here. (a) the port on which the broker accepts AMQP connections, (b) the port on which the broker accepts distributed erlang connections (which is assigned by epmd) or something else (?) so I'm not really sure what we're saying here.<br>


<div class=""><br>

> , but nothing shows in the log files for rabbit itself, nothing in the sasl logs, no content from the management port, etc.<br>

<br>

</div>Hmn, so you're sure (via ps and/or top) that there are beam.smp processes running, and you can see (via netstat) that the management HTTP port is in use, but there's no response from the HTTP (management) server?<br>


<div class=""><br>

>  The OS was completely responsive - I could get to the shell, do most commands, though lsof (I think I mentioned this) wouldn't respond in any timely manner while rabbit server was running.<br>

<br>

</div>That _is_ very strange.<br>

<div class=""><br>

>  All the ways I know of to talk to rabbit and all the os things i know to do were failing and all the things I know to try and restart it (killing the processes, waiting for network connections in TIM_WAIT to drain, killing EPMD as well) all failed.<br>


><br>

> At that point, I recycled each of the servers and they're back to a running state.  I don't know that I have a box handy but I'll see what I can do to replicate this.  For some reason I think I've seen this before and it's something with the OEL 6.2 kernel that Oracle put together, dealing with disk io flush on a journaled file system.  I'm pretty sure I've seen the same thing about 6 months ago or so on a completely different set of servers.<br>


<br>

</div>Urgh, that sounds horrible. The more info you can provide us with the better. If you can replicate, that would be amazing since we can do the same thing and investigate.<br>

<div class=""><div class="h5"><br>

Cheers,<br>

Tim<br>

<br>

_______________________________________________<br>

rabbitmq-discuss mailing list<br>

<a href="mailto:rabbitmq-discuss@lists.rabbitmq.com">rabbitmq-discuss@lists.rabbitmq.com</a><br>

<a href="https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss" target="_blank">https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss</a><br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div dir="ltr">Jason McIntosh<br><a href="https://github.com/jasonmcintosh/" target="_blank">https://github.com/jasonmcintosh/</a><br>573-424-7612</div>


</div></div>