[rabbitmq-discuss] Feature Req / Bug list

Fri Oct 4 19:32:33 BST 2013

On Fri, Oct 4, 2013 at 1:54 AM, Tim Watson <watson.timothy at gmail.com> wrote:

> > All items below were discovered while deploying 3.1.5 over the past few
> days. Hosts in question have 24 sandy bridge HT cores, 64GB of RAM, XFS
> filesystem, running on CentOS 6. Cluster is 5 nodes, with a default HA
> policy on all queues of exact/3/automatic-sync.
> >
>
> That's a very strong consistency and redundency guarantee for every queue.
> Do you really need such strong guarantees for all of them? There is a cost
> to doing ha.
>

Yes, it's very important we never lose a message if it gets accepted for
delivery. We're more than willing to pay the overhead in terms of more
hardware as necessary. The 5 node count is just for initial deployment, I'd
expect us to double the cluster size every year for the next 3-5 years to
deal with our workload growth, even counting improvements in individual
server processing power.

>  > - expected queues to be distributed evenly among cluster machines,
> instead got all queues on first 3 machines in the cluster, nothing on the
> last 2.
>
> Distributed evenly in what regard? Randomly, or based on some metric?
>

Doesn't matter. Random or round robin would be sufficient. We use in the
order of 100s of queues, and so even with ~10% having a somewhat higher
workload, any distribution scheme would balance the load out between
machines reasonably evenly.

 > - expected message reads from a mirror machine for a queue to do the
> read i/o locally, so as to spread out workload, but it appears to always go
> to the host where the queue was created.
>
> That's expected behaviour. In a master-slave configuration, writes have to
> go to the master. Odd though it may sound, reads from a queue involve
> writes, since we have to do accounting (of e.g.,  pending ACKs, position in
> the queue, etc), so all requests are handled by the master.
>

Yeah, I understand the logic around needing to do queue management and
requiring locks and writing. It just doesn't make sense to me that the read
can't happen locally if the data exists locally, after all appropriate
queue locks and bookkeeping have completed. I imagine this is just for code
simplicity rather than any technical limitation, and it's something that
really isn't an issue if we can evenly balance queues between cluster
hosts. I also imagine it isn't an issue for people who aren't trying to
send large, persistent, binary messages through the queueing system, since
they probably never run into IO limitations.

 > - this led to a single node with ~35k active open filehandles, and 4
> nodes with ~90. not an optimum distribution of read workload.
>
> Agreed. Simon or Marthias may be able to elaborate on various things we're
> working on to improve workload distribution.
>

Great! We're doing some work on our code to manually distribute queues at
creation time, but it'd be a lot better if there was a switch to pull on
the rabbit end to just make it happen.

>  > - expected that if system a queue was created on is permanently removed
> (shut down and "rabbitmqctl forget_cluster_node hostname"'d), automatic
> sync would ensure there's the right number of copies replicated, but
> instead it just left every single queue under replicated.
>
> That doesn't sound right. It's not automatic sync we're talking here
> either - that sounds like the policy isn't getting applied properly.
>

Hmm... Well, we're just applying a global policy with the pattern ".*", and
it shows as being applied in the queue information API and on the web page.
I'm not sure how to check if it's fully applied otherwise, so if you've got
something I can run to check that, I can definitely do some digging.

>  > - when a new policy is applied that defines specific replication nodes,
> or a number of copies using 'exact, and auto-sync is set, sometimes it just
> syncs the first replica and leaves any others unsynced and calls it job
> done. This is bad.
>
> Can you provide us with a way to reproduce this? How did you detect that
> the remaining replicas were not sync'ed?
>

Detection was just by looking at the queue page in the management web GUI.
It shows a big blue +1 and a big red +1 next to maybe 10% of queues after
applying the global queue policy after all sync ops complete. If I issue a
manual sync operation on all the problem queues, then they correctly finish
syncing up the 3rd data copy. I'll see if I can script up a way to
reproduce it on clean set of nodes, since I'm trying not to break my prod
cluster any more than I have this week. I'll e-mail the list once I've got
a reproducible test case.

>  > - Attempted to create small per-queue policies to redistribute messages
> and then delete the per-queue policies, but this often leads to a
> inconsistent cluster state where queues continued to show as being part of
> a policy that was already deleted, attempt to resync, and get stuck, unable
> to complete or switch back to the global default policy.
>
> Again, it would be helpful if you could help us to replicate this.
>

This is 100% reproducible on our prod cluster. I've got a python script
that attempts the rebalancing on a cluster, so I'll add some logic to get
it to generate and populate queues to reproduce this on a fresh cluster,
and e-mail that out.

>  > - sometimes the cluster refuses to accept any more policy commands.
> Have to fully shut down and restart the cluster to clear this condition.
>
> And this. Can you provide a run down of these policies and the order in
> which you're trying to apply them? Also, how busy are the queues whilst the
> policy changes are happening? We may need to extend our test beds to
> reliably reproduce such problems.
>

This case happens after attempting a bunch of policy operations from the
previous mentioned script, so it should be easy enough to see it in action
once I've got a script to reproduce the previous issue. We saw this
happening with as low as 5 messages/sec on the whole cluster, so it doesn't
seem to be load related.

> > - sometimes policies applied to empty and inactive queues don't get
> correctly applied, and the queue hangs on "resyncing / 100%".l
>
> What!?
>

Yeah. That was my reaction as well. We saw this after removing the
per-queue polices created with the previous mentioned script, after the
queues reverted to the global exact/3/autosync policy. I had to actually
kill all of my rabbitmq instances as they wouldn't nicely shut down, and
then bring the whole cluster back up to clear this.

>  > this makes no sense, given the queue is empty, and requires a full
> cluster restart to clear.
>
> Please provide the commands you invoked to get this to happen.
>

Again, this are all things noticed after running the script mentioned above
to do the per-queue policies. I didn't intentionally do anything to make
these errors occur, but once I've got a script to reproduce the first set
of errors on a fresh cluster, it should be easy enough to see some of these
other issues, since they seem to cascade from the first set of problems.

>  > - I've managed to get the cluster into an inconsistent state a /lot/
> using the HA features, so it feels like they need more automated stress
> testing and bulletproofing.
>
> If you can help us repoduce these errors, I can assure you that they'll
> get included in our integration tests!
>

Great. I'll get to work on being able to solidly reproduce at least the
first set of issues I encountered, and hopefully that'll lead to a
reproduction path for some of the other ones.

>  > Persistent message storage:
> >
> > - it appears as if messages are put into very small batch files on the
> filesystem (1-20 MB)
> > - this causes the filesystem to thrash if your IO isn't good at random
> IO (SATA disks) and you have lots of persistent messages (>200k messages
> 500kB-1MB in size) that don't fit in RAM.
> > - this caused CentOS 6 kernel to kill erlang after stalling the XFS
> filesystem for > 120s.
>
> Iirc this is tuneable, though we don't recommend changing it. Not at u
> desk right now though, so I can't remember the exact details.
>

It doesn't seem to be an issue since we've switched to SSDs, so I'm not
going to spend a lot of time worrying about it. It'd just be nice to see
some supported tuning options for this make a dev roadmap for the future.

>  > - if a node crashes, Rabbit seems to rescan the entire on-disk
> datastore before continuing, instead of using some sort of checkpointing or
> journaling system to quickly recover from a crash.
> > - all of above should be solvable by using an existing append-only
> datastore like eLevelDB or Bitcask.
>
> On our todo list already, at least for the message store index.
>

Great, glad to hear it.

There's probably a lot of performance improvements to be had by using
something like eleveldb or bitcask, since they do a lot to optimize disk
seeks and RAM buffering, but I imagine that's a fairly ambitious amount of
work that isn't high particularly high priority for you guys. Just a
suggestion to think about in the long term, or to put an intern on testing.
;)

In any case, I'm going to get working on reproducible test cases for all
the issues we've been discussing. I'll update this thread when I have
something concrete for you.

Graeme
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131004/ee9abb52/attachment.htm>