[rabbitmq-discuss] Occasional slow message on a local machine
Brennan Sellner
bsellner at seegrid.com
Wed Sep 5 22:19:09 BST 2012
My apologies for the long delay, but we have quite a bit more
information available, after doing some extensive field testing.
We now return you to your thread already in progress...
On 07/24/2012 05:38 AM, Francesco Mazzoli wrote:
> Hi Brennan,
> At Mon, 23 Jul 2012 15:57:24 -0400,
> Brennan Sellner wrote:
>> The rabbitmq server (v2.8.2) is running on the same machine, so there aren't
>> any external networks involved. We're still running Fedora 11 (kernel 2.6.29,
>> Erlang R12B-5.8) for the moment, if that's relevant.
>
> Please upgrade both RabbitMQ and Erlang if you can (especially rabbit), or
> things will be hard to diagnose for us.
We've upgraded Rabbit to 2.8.6. Erlang has not yet been upgraded, but
will be in our next release cycle. R12B-5.8 was the newest official
build we could find that worked with Fedora 11; I've since been able to
hand-build the latest Erlang release, but it's still working its way
through our testing department.
We now have two different types of systems fielded, one where we're
having issues with slow Rabbit operations, and one where we're not (or,
at least any slow operations aren't slow enough to trigger our
watchdogs). The software configuration is identical between the two
configurations.
#1: Problematic:
- CPU: Core 2 Duo E7300 @ 2.66 GHz
- RAM: 2 GB
- Disk: SATA 3.0 Gb/s, 5400 RPM
#2: Working fine:
- CPU: i5 M520 @ 2.4 GHz (2 cores + hyperthreading)
- RAM: 4 GB
- Disk: SATA 6.0 Gb/s, SSD
With #1, we're seeing the following librabbitmq-c functions occasionally
take up to 7.28 seconds to return (3-4 seconds is much more common):
- amqp_queue_declare
- amqp_queue_bind
- amqp_basic_consume
We don't have any persistent messages. Queues are being declared with
server-assigned names, passive is false, durable is false, exclusive is
true, and auto-delete is true. The consume calls use an empty consumer
tag, set no-local to false, no-ack to false, and exclusive to false.
We've primarily observed this happening just after launching a new
thread, which establishes a fresh connection to Rabbit successfully and
quickly, but has issues when establishing its first subscription. This
may be sampling bias, though, as our watchdogs are by chance mostly
concentrated around new threads being launched. The full sequence of
librabbitmq-c calls in this case is:
- amqp_new_connection
- amqp_open_socket
- amqp_set_sockfd
- amqp_login
- amqp_channel_open
- amqp_basic_qos
- amqp_channel_open # For the new subscription
- amqp_basic_qos
- amqp_exchange_declare
- amqp_queue_declare # Sometimes slow
- amqp_queue_bind # Sometimes slow
- amqp_basic_consume # Sometimes slow
My suspicion is that Rabbit is hitting disk for some reason; we've seen
I/O-related delays with other processes (e.g. sqlite) with configuration
#1 that we don't see with #2. The SSD may be enough faster that we
don't notice the issue on configuration #2.
However, I don't see any reason that Rabbit would be hitting the disk.
Once the problematic librabbitmq-c function times out, our code dumps
`rabbitmqctl status` to disk. To take a concrete example,
amqp_queue_bind took 3.72 seconds to return. When the dump ran two
seconds later, it reported the below:
Status of node 'AISINR4 at AISINR4' ...
[{pid,1698},
{running_applications,[{rabbitmq_shovel,"Data Shovel for
RabbitMQ","2.8.6"},
{erlando,"Syntax extensions for Erlang","2.8.6"},
{amqp_client,"RabbitMQ AMQP Client","2.8.6"},
{rabbit,"RabbitMQ","2.8.6"},
{os_mon,"CPO CXC 138 46","2.1.8"},
{sasl,"SASL CXC 138 11","2.1.5.4"},
{mnesia,"MNESIA CXC 138 12","4.4.7"},
{stdlib,"ERTS CXC 138 10","1.15.5"},
{kernel,"ERTS CXC 138 10","2.12.5"}]},
{os,{unix,linux}},
{erlang_version,"Erlang (BEAM) emulator version 5.6.5 [source] [smp:2]
[async-threads:30] [hipe] [kernel-poll:true]\n"},
{memory,[{total,22599840},
{processes,9909296},
{processes_used,9706816},
{system,12690544},
{atom,782145},
{atom_used,751012},
{binary,650640},
{code,6257852},
{ets,4197052}]},
{vm_memory_high_watermark,0.4},
{vm_memory_limit,825982976},
{disk_free_limit,1000000000},
{disk_free,140100812800},
{file_descriptors,[{total_limit,924},
{total_used,17},
{sockets_limit,829},
{sockets_used,13}]},
{processes,[{limit,1048576},{used,304}]},
{run_queue,0},
{uptime,2106}]
...done.
It doesn't look like we're anywhere near the memory high watermark, nor
are we stressing the system in terms of sockets or file handles. We
dumped the output of top at the same time, which indicates that we're
not touching swap:
top - 05:18:56 up 35 min, 0 users, load average: 3.85, 2.26, 1.79
Tasks: 97 total, 2 running, 95 sleeping, 0 stopped, 0 zombie
Cpu(s): 16.1%us, 10.2%sy, 0.0%ni, 69.2%id, 3.6%wa, 0.6%hi, 0.3%si,
0.0%st
Mem: 2016560k total, 1658564k used, 357996k free, 157468k buffers
Swap: 1023992k total, 0k used, 1023992k free, 787772k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1698 rabbitmq 20 0 79096 26m 2136 S 7.9 1.4 2:16.89 beam.smp
There was nothing interesting in either of Rabbit's logs.
> You might also be hitting flow control, but that is related to running out of
> RAM. You can check that as well via the management plugin, the connections will
> be marked as "blocked".
`rabbitmqctl list_connections pid name last_blocked_by last_blocked_age`
lists the last-blocked-age as infinity for all connections. The
management plugin is unfortunately not supported with the version of
Erlang that we're using at the moment.
Anyone have any ideas about what might cause these random slow
responses, or other data we can collect to help debug? I'm rather
baffled at the moment.
Thanks,
-Brennan
More information about the rabbitmq-discuss
mailing list