[rabbitmq-discuss] Occasional slow message on a local machine

Thu Sep 6 10:11:22 BST 2012

On 09/06/2012 04:17 AM, Tim Watson wrote:
> 
> On 5 Sep 2012, at 22:19, Brennan Sellner <bsellner at seegrid.com> wrote:
>> We've upgraded Rabbit to 2.8.6.  Erlang has not yet been upgraded, but will be in our next release cycle.  R12B-5.8 was the newest official build we could find that worked with Fedora 11; I've since been able to hand-build the latest Erlang release, but it's still working its way through our testing department.
>>
> 
> R12 is really old and the general performance if the emulator has improved hugely since.

Understood.  Is it plausible that the Erlang version could explain the
sort of intermittent long delays we're seeing?  The upgrade is in the
works, but as we're dealing with a fielded robotic system, there's quite
a bit of testing necessary before an underpinning change like that is
released.

>> With #1, we're seeing the following librabbitmq-c functions occasionally take up to 7.28 seconds to return (3-4 seconds is much more common):
>>  - amqp_queue_declare
>>  - amqp_queue_bind
>>  - amqp_basic_consume
>>
> 
> Can you ascertain how much of that time is spent is system calls waiting for the network?

Not at present: I haven't instrumented librabbitmq-c as yet.  However,
Rabbit and the client are running on the same machine, connecting over
the loopback (127.0.0.1) interface.  I would expect network delays to be
minimal, though I haven't proven that yet.

>> We've primarily observed this happening just after launching a new thread, which establishes a fresh connection to Rabbit successfully and quickly, but has issues when establishing its first subscription.  
> 
> You do know that librabbitmq-c is *not* thread safe, at least iirc!?

Yep.  Each thread opens its own TCP connection to Rabbit, and maintains
its own state.

>> My suspicion is that Rabbit is hitting disk for some reason; we've seen I/O-related delays with other processes (e.g. sqlite) with configuration #1 that we don't see with #2.  The SSD may be enough faster that we don't notice the issue on configuration #2.
>>
> 
> According to top output (below) that's not happening an rabbit will not page to disk without hitting the high memory watermark if your messages are non persistent.
> 
>> However, I don't see any reason that Rabbit would be hitting the disk.
> 
> Exactly.

Okay, thanks.  I wasn't sure if there were any other triggers that would
cause Rabbit to touch the disk, such as the creation of a queue.

> 
> I wonder if this delay is happening in the client rather than the server. Does a standalone sample client with no threading take so lob to perform the basic operations?
> 

I haven't been able to reproduce this in a standalone example yet, but I
also haven't made the attempt on the same hardware that we're seeing the
problem on.  Doing so is next on my list.

>>> You might also be hitting flow control, but that is related to running out of
>>> RAM.  
> 
> Um, isn't flow control actually about applying back pressure to producers when the system is overloaded? That sounds feasible and isn't afaict just about hitting ram limits.

How does Rabbit define 'overloaded'?  It's a dual-core system; one core
is pretty much pegged by one of our processes, but the other is under
only light load.  I've never seen Rabbit consume more than ~10% of it.

Thanks,

-Brennan