[rabbitmq-discuss] Feature Req / Bug list

Thu Oct 24 19:30:31 BST 2013

Hi everyone,

Took longer than I anticipated, but here's my first pass at reproducible
test cases for some of the issues I reported. Attached to this e-mail is a
tarball that I've been unpacking into /etc/rabbitmq/cluster on a 4CPU/4GB
RAM VM running CentOS 6.4, rabbitmq-server-3.2.0-1.noarch.rpm, and
librabbitmq-tools-0.3.0-1.el6.x86_64. These are the same issues we saw on
our production hardware, so they seem to be reproducible.

- cd /etc/rabbitmq/cluster
- create a 5 node local cluster: ./create_cluster.sh
- Bug: even though node and management port listeners are specified, the
first instance started will still incorrectly bind to port 55672 for the
management interface.
- remove existing queues and create and 100 queues in parallel:
./setup_queues.sh
- Bugs: many of the operations fail, instead of just blocking because the
cluster is busy, with a variety of error messages, and often even have
rabbit commands hang and never return. Run this script in a "while true"
bash loop to see it fall apart pretty quickly.
- populate queues with 1000 messages each in parallel: ./populate_queues.sh
- Note: shows low delivery rates noted before on spinning disks (60-80
msgs/sec), even though my VM storage is on btrfs RAID10 capable of
sustained block writes > 200MB/s. iostat shows the VM is only generating
1-8 MB/s of IO. Looking at messages under
/var/lib/rabbitmq/mnesia/rabbit2 at localhost/queues, they seem to be chunked
into 64, 68, 72, and 84 kiB files before being delivered to the 16MiB
msg_store_persistent/*.rdq files. This implies a lot of random IO while
delivering messages, which explains why the performance problems disappear
when switching to SSDs, even just two SSDs in RAID1. Typically with other
data stores we'd expect to see on-disk chunks that are multiples of 128 MiB
to properly leverage RAID block IO, in both incoming and finalized data
stores. The effect of this is that it takes ~20m to load ~32 MiB of
messages, which is pretty awful.
- set policies to evenly balance queues across cluster nodes:
./rebalance_queues.sh
- Bugs: Also demonstrates API failures with too many simultaneous requests,
may require ctl-Cing the script to re-run it. Run multiple times, and admin
interface shows ~20% of queues don't fully sync to all 3 specified nodes
after new "nodes" policy is applied, they require additional sync commands
even though they have ha-sync-mode:automatic. After 2-3 runs, some queues
get stuck with "0% resyncing", and the entire API interface stops
responding completely until the cluster is killed and restarted.

Hopefully you guys can replicate these results, since they are 100%
reproducible here. Any questions / comments, fire away.

Graeme

On Fri, Oct 4, 2013 at 11:32 AM, Graeme N <graeme at sudo.ca> wrote:

>
> On Fri, Oct 4, 2013 at 1:54 AM, Tim Watson <watson.timothy at gmail.com>wrote:
>
>> > All items below were discovered while deploying 3.1.5 over the past few
>> days. Hosts in question have 24 sandy bridge HT cores, 64GB of RAM, XFS
>> filesystem, running on CentOS 6. Cluster is 5 nodes, with a default HA
>> policy on all queues of exact/3/automatic-sync.
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131024/f1bf0302/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: break-rabbitmq-v1.tar.xz
Type: application/x-xz
Size: 1312 bytes
Desc: not available
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131024/f1bf0302/attachment.bin>