[rabbitmq-discuss] System CPU spikes, multi-second latency

Fri Oct 14 13:27:51 BST 2011

Hi Ben,

[ liberally snipped ]

Many thanks for the detailed report and your thorough investigation.

On Wed, Oct 12, 2011 at 10:41:25PM -0500, Ben Rady wrote:
> These episodes last from just a couple of seconds up to a minute, and seem
> to happen randomly, sometimes coming one right after another, sometimes
> hours apart. We've never seen it less than once a day.

Very odd - we've not seen anything like that before. Are there any
entries in the Rabbit logs during this period?

> ==What we've already done==
> 
> 1) Troubleshooting using strace
...
> The first system call logged for 21860 occurs at 21:13:53. However, notice
> the timestamp and 4.4 second execution time for the futex call in the other
> thread (21862). All of the straces for the threads were started at the same
> time (using the -ff option) and a number of other threads have a similar
> call around 21:13:48.6635 as their first logged system call (within a few
> milliseconds). This leads me to believe that whatever 21860 is doing during
> this time, it's working in a single system call rather
> than repeatedly calling in a loop.

This is really very odd indeed. When Rabbit is blocking on IO, I
normally expect to see top showing that the CPU is blocking waiting, not
in system. Other than reading and writing to files and sockets, Rabbit
doesn't really do a great deal more *explicitly* that will result in
system calls - thus it's very likely this is something odd happening
with the Erlang VM itself.

The fact that apparently the whole Erlang VM blocks during this
operation is deeply worrying.

> 2) Reproducing the problem in a controlled environment
> 
> This problem only seems to occur in our production environments, but not in
> any test environments. The software and hardware in the two environments are
> identical. We've used data recorded from our prod environment to try and
> reproduce the problem in test (with higher load, even), but no luck so far.

Just in case it's a hardware issue then, are you able to try swapping
the machines over?

I'd also suggest running for a few days without the web API just to see
if that makes any difference. I'd also suggest using your own
self-compiled Erlang just in case there's some issue with the
Erlang-from-RPM that you're using. Also, R14B04 has just come out, and
it would be good to verify the problem still occurs with that.

During these pauses, is the rest of the machine responsive? Can you
generally do things that access the disks of the machine?

One thing that I've seen several times over the last couple of years
(and it now fills me with dread whenever I see it) is when a hard disk
starts to fail, the machine starts exhibiting pauses of several seconds
whilst it resets the drive controller and starts again. That said
though, that a) tends to produce a very very high load average (over
100 very quickly), and b) top shows blocked on waiting, not blocked on
system.

If it is some futex though that's taking a long time to return (maybe
associated with some core part of the erlang scheduler), I wonder if
that points to some bad RAM somewhere. I take it there are no
corresponding entries in dmesg or such like?

> I've got more data on this if anyone wants to hear it. Thanks in advance for
> your help.

Definitely - as I say, we've not seen this before, and it smells to me
like a problem that may lie outside of Rabbit itself - especially given
the fact it seems reticent to occur in your testing env.

It's also been suggested that you might just have too much RAM, and too
many cores, and should send some to us instead. ;)

Best wishes,

Matthew