[rabbitmq-discuss] Feature Req / Bug list

Fri Oct 4 09:54:20 BST 2013

On 3 Oct 2013, at 22:03, Graeme N <graeme at sudo.ca> wrote:
> 
> All items below were discovered while deploying 3.1.5 over the past few days. Hosts in question have 24 sandy bridge HT cores, 64GB of RAM, XFS filesystem, running on CentOS 6. Cluster is 5 nodes, with a default HA policy on all queues of exact/3/automatic-sync.
> 

That's a very strong consistency and redundency guarantee for every queue. Do you really need such strong guarantees for all of them? There is a cost to doing ha.

> HA / Clustering:
> 
> - expected queues to be distributed evenly among cluster machines, instead got all queues on first 3 machines in the cluster, nothing on the last 2.

Distributed evenly in what regard? Randomly, or based on some metric?

> - expected message reads from a mirror machine for a queue to do the read i/o locally, so as to spread out workload, but it appears to always go to the host where the queue was created.

That's expected behaviour. In a master-slave configuration, writes have to go to the master. Odd though it may sound, reads from a queue involve writes, since we have to do accounting (of e.g.,  pending ACKs, position in the queue, etc), so all requests are handled by the master.

> - this led to a single node with ~35k active open filehandles, and 4 nodes with ~90. not an optimum distribution of read workload.

Agreed. Simon or Marthias may be able to elaborate on various things we're working on to improve workload distribution.

> - expected that if system a queue was created on is permanently removed (shut down and "rabbitmqctl forget_cluster_node hostname"'d), automatic sync would ensure there's the right number of copies replicated, but instead it just left every single queue under replicated.

That doesn't sound right. It's not automatic sync we're talking here either - that sounds like the policy isn't getting applied properly.

> - when a new policy is applied that defines specific replication nodes, or a number of copies using 'exact, and auto-sync is set, sometimes it just syncs the first replica and leaves any others unsynced and calls it job done. This is bad.

Can you provide us with a way to reproduce this? How did you detect that the remaining replicas were not sync'ed?

> - Attempted to create small per-queue policies to redistribute messages and then delete the per-queue policies, but this often leads to a inconsistent cluster state where queues continued to show as being part of a policy that was already deleted, attempt to resync, and get stuck, unable to complete or switch back to the global default policy.

Again, it would be helpful if you could help us to replicate this. 

> - sometimes the cluster refuses to accept any more policy commands. Have to fully shut down and restart the cluster to clear this condition.

And this. Can you provide a run down of these policies and the order in which you're trying to apply them? Also, how busy are the queues whilst the policy changes are happening? We may need to extend our test beds to reliably reproduce such problems.

> - sometimes policies applied to empty and inactive queues don't get correctly applied, and the queue hangs on "resyncing / 100%".l

What!?

> this makes no sense, given the queue is empty, and requires a full cluster restart to clear.

Please provide the commands you invoked to get this to happen.

> - I've managed to get the cluster into an inconsistent state a /lot/ using the HA features, so it feels like they need more automated stress testing and bulletproofing.

If you can help us repoduce these errors, I can assure you that they'll get included in our integration tests!

> 
> Persistent message storage:
> 
> - it appears as if messages are put into very small batch files on the filesystem (1-20 MB)
> - this causes the filesystem to thrash if your IO isn't good at random IO (SATA disks) and you have lots of persistent messages (>200k messages 500kB-1MB in size) that don't fit in RAM.
> - this caused CentOS 6 kernel to kill erlang after stalling the XFS filesystem for > 120s.

Iirc this is tuneable, though we don't recommend changing it. Not at u desk right now though, so I can't remember the exact details.

> - if a node crashes, Rabbit seems to rescan the entire on-disk datastore before continuing, instead of using some sort of checkpointing or journaling system to quickly recover from a crash.
> - all of above should be solvable by using an existing append-only datastore like eLevelDB or Bitcask.

On our todo list already, at least for the message store index.

> Hopefully you guys can educate me on what I'm doing wrong in some of these scenarios, or how to mitigate some of these issues. Any issue that requires taking down and restarting the cluster to fix is especially troubling.
> 
> Thanks,
> Graeme
> 
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss