[rabbitmq-discuss] Occasional slow message on a local machine

Thu Sep 6 09:17:02 BST 2012

On 5 Sep 2012, at 22:19, Brennan Sellner <bsellner at seegrid.com> wrote:

> My apologies for the long delay, but we have quite a bit more information available, after doing some extensive field testing.
> 
> We now return you to your thread already in progress...
> 
> 
> We've upgraded Rabbit to 2.8.6.  Erlang has not yet been upgraded, but will be in our next release cycle.  R12B-5.8 was the newest official build we could find that worked with Fedora 11; I've since been able to hand-build the latest Erlang release, but it's still working its way through our testing department.
> 

R12 is really old and the general performance if the emulator has improved hugely since.

> We now have two different types of systems fielded, one where we're having issues with slow Rabbit operations, and one where we're not (or, at least any slow operations aren't slow enough to trigger our watchdogs).  The software configuration is identical between the two configurations.
> 
> #1: Problematic:
>  - CPU: Core 2 Duo E7300 @ 2.66 GHz
>  - RAM: 2 GB
>  - Disk: SATA 3.0 Gb/s, 5400 RPM
> 
> #2: Working fine:
>  - CPU: i5 M520 @ 2.4 GHz (2 cores + hyperthreading)
>  - RAM: 4 GB
>  - Disk: SATA 6.0 Gb/s, SSD
> 
> With #1, we're seeing the following librabbitmq-c functions occasionally take up to 7.28 seconds to return (3-4 seconds is much more common):
>  - amqp_queue_declare
>  - amqp_queue_bind
>  - amqp_basic_consume
> 

Can you ascertain how much of that time is spent is system calls waiting for the network?

> We don't have any persistent messages.  Queues are being declared with server-assigned names, passive is false, durable is false, exclusive is true, and auto-delete is true.  The consume calls use an empty consumer tag, set no-local to false, no-ack to false, and exclusive to false.
> 
> We've primarily observed this happening just after launching a new thread, which establishes a fresh connection to Rabbit successfully and quickly, but has issues when establishing its first subscription.  

You do know that librabbitmq-c is *not* thread safe, at least iirc!?

> This may be sampling bias, though, as our watchdogs are by chance mostly concentrated around new threads being launched.  The full sequence of librabbitmq-c calls in this case is:
> 
>  - amqp_new_connection
>  - amqp_open_socket
>  - amqp_set_sockfd
>  - amqp_login
>  - amqp_channel_open
>  - amqp_basic_qos
>  - amqp_channel_open  # For the new subscription
>  - amqp_basic_qos
>  - amqp_exchange_declare
>  - amqp_queue_declare # Sometimes slow
>  - amqp_queue_bind    # Sometimes slow
>  - amqp_basic_consume # Sometimes slow
> 
> My suspicion is that Rabbit is hitting disk for some reason; we've seen I/O-related delays with other processes (e.g. sqlite) with configuration #1 that we don't see with #2.  The SSD may be enough faster that we don't notice the issue on configuration #2.
> 

According to top output (below) that's not happening an rabbit will not page to disk without hitting the high memory watermark if your messages are non persistent.

> However, I don't see any reason that Rabbit would be hitting the disk.

Exactly.

> Once the problematic librabbitmq-c function times out, our code dumps `rabbitmqctl status` to disk.  To take a concrete example, amqp_queue_bind took 3.72 seconds to return.  When the dump ran two seconds later, it reported the below:
> 
> Status of node 'AISINR4 at AISINR4' ...
> [{pid,1698},
> {running_applications,[{rabbitmq_shovel,"Data Shovel for RabbitMQ","2.8.6"},
>                        {erlando,"Syntax extensions for Erlang","2.8.6"},
>                        {amqp_client,"RabbitMQ AMQP Client","2.8.6"},
>                        {rabbit,"RabbitMQ","2.8.6"},
>                        {os_mon,"CPO  CXC 138 46","2.1.8"},
>                        {sasl,"SASL  CXC 138 11","2.1.5.4"},
>                        {mnesia,"MNESIA  CXC 138 12","4.4.7"},
>                        {stdlib,"ERTS  CXC 138 10","1.15.5"},
>                        {kernel,"ERTS  CXC 138 10","2.12.5"}]},
> {os,{unix,linux}},
> {erlang_version,"Erlang (BEAM) emulator version 5.6.5 [source] [smp:2] [async-threads:30] [hipe] [kernel-poll:true]\n"},
> {memory,[{total,22599840},
>          {processes,9909296},
>          {processes_used,9706816},
>          {system,12690544},
>          {atom,782145},
>          {atom_used,751012},
>          {binary,650640},
>          {code,6257852},
>          {ets,4197052}]},
> {vm_memory_high_watermark,0.4},
> {vm_memory_limit,825982976},
> {disk_free_limit,1000000000},
> {disk_free,140100812800},
> {file_descriptors,[{total_limit,924},
>                    {total_used,17},
>                    {sockets_limit,829},
>                    {sockets_used,13}]},

Looking at fd volumes, I doubt the message store is writing to disk although I could be wrong.

> {processes,[{limit,1048576},{used,304}]},
> {run_queue,0},
> {uptime,2106}]
> ...done.
> 
> It doesn't look like we're anywhere near the memory high watermark, nor are we stressing the system in terms of sockets or file handles.  We dumped the output of top at the same time, which indicates that we're not touching swap:
> 
> top - 05:18:56 up 35 min,  0 users,  load average: 3.85, 2.26, 1.79
> Tasks:  97 total,   2 running,  95 sleeping,   0 stopped,   0 zombie
> Cpu(s): 16.1%us, 10.2%sy,  0.0%ni, 69.2%id,  3.6%wa,  0.6%hi,  0.3%si, 0.0%st
> Mem:   2016560k total,  1658564k used,   357996k free,   157468k buffers
> Swap:  1023992k total,        0k used,  1023992k free,   787772k cached
> 
>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND 
> 1698 rabbitmq  20   0 79096  26m 2136 S  7.9  1.4   2:16.89 beam.smp 
> 
> There was nothing interesting in either of Rabbit's logs.
> 

I wonder if this delay is happening in the client rather than the server. Does a standalone sample client with no threading take so lob to perform the basic operations?

>> You might also be hitting flow control, but that is related to running out of
>> RAM.  

Um, isn't flow control actually about applying back pressure to producers when the system is overloaded? That sounds feasible and isn't afaict just about hitting ram limits.

>> You can check that as well via the management plugin, the connections will
>> be marked as "blocked".
> 
> `rabbitmqctl list_connections pid name last_blocked_by last_blocked_age` lists the last-blocked-age as infinity for all connections.  The management plugin is unfortunately not supported with the version of Erlang that we're using at the moment.
> 
> Anyone have any ideas about what might cause these random slow responses, or other data we can collect to help debug?  I'm rather baffled at the moment.
> 
> Thanks,
> 
> -Brennan
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss