<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Fri, Oct 4, 2013 at 1:54 AM, Tim Watson <span dir="ltr">&lt;<a href="mailto:watson.timothy@gmail.com" target="_blank">watson.timothy@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">&gt; All items below were discovered while deploying 3.1.5 over the past few days. Hosts in question have 24 sandy bridge HT cores, 64GB of RAM, XFS filesystem, running on CentOS 6. Cluster is 5 nodes, with a default HA policy on all queues of exact/3/automatic-sync.<br>


&gt;<br>

<br>

</div>That&#39;s a very strong consistency and redundency guarantee for every queue. Do you really need such strong guarantees for all of them? There is a cost to doing ha.<br></blockquote><div><br></div><div>Yes, it&#39;s very important we never lose a message if it gets accepted for delivery. We&#39;re more than willing to pay the overhead in terms of more hardware as necessary. The 5 node count is just for initial deployment, I&#39;d expect us to double the cluster size every year for the next 3-5 years to deal with our workload growth, even counting improvements in individual server processing power.<br>

</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im">

&gt; - expected queues to be distributed evenly among cluster machines, instead got all queues on first 3 machines in the cluster, nothing on the last 2.<br>

<br>

</div>Distributed evenly in what regard? Randomly, or based on some metric?<br></blockquote><div><br></div><div>Doesn&#39;t matter. Random or round robin would be sufficient. We use in the order of 100s of queues, and so even with ~10% having a somewhat higher workload, any distribution scheme would balance the load out between machines reasonably evenly.<br>

<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im">

&gt; - expected message reads from a mirror machine for a queue to do the read i/o locally, so as to spread out workload, but it appears to always go to the host where the queue was created.<br>

<br>

</div>That&#39;s expected behaviour. In a master-slave configuration, writes have to go to the master. Odd though it may sound, reads from a queue involve writes, since we have to do accounting (of e.g.,  pending ACKs, position in the queue, etc), so all requests are handled by the master.<br>

</blockquote><div><br></div><div>Yeah, I understand the logic around needing to do queue management and requiring locks and writing. It just doesn&#39;t make sense to me that the read can&#39;t happen locally if the data exists locally, after all appropriate queue locks and bookkeeping have completed. I imagine this is just for code simplicity rather than any technical limitation, and it&#39;s something that really isn&#39;t an issue if we can evenly balance queues between cluster hosts. I also imagine it isn&#39;t an issue for people who aren&#39;t trying to send large, persistent, binary messages through the queueing system, since they probably never run into IO limitations.<br>

<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im">

&gt; - this led to a single node with ~35k active open filehandles, and 4 nodes with ~90. not an optimum distribution of read workload.<br>

<br>

</div>Agreed. Simon or Marthias may be able to elaborate on various things we&#39;re working on to improve workload distribution.<br></blockquote><div><br></div><div>Great! We&#39;re doing some work on our code to manually distribute queues at creation time, but it&#39;d be a lot better if there was a switch to pull on the rabbit end to just make it happen.<br>

</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im">

&gt; - expected that if system a queue was created on is permanently removed (shut down and &quot;rabbitmqctl forget_cluster_node hostname&quot;&#39;d), automatic sync would ensure there&#39;s the right number of copies replicated, but instead it just left every single queue under replicated.<br>


<br>

</div>That doesn&#39;t sound right. It&#39;s not automatic sync we&#39;re talking here either - that sounds like the policy isn&#39;t getting applied properly.<br></blockquote><div><br></div><div>Hmm... Well, we&#39;re just applying a global policy with the pattern &quot;.*&quot;, and it shows as being applied in the queue information API and on the web page. I&#39;m not sure how to check if it&#39;s fully applied otherwise, so if you&#39;ve got something I can run to check that, I can definitely do some digging.<br>

</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im">

&gt; - when a new policy is applied that defines specific replication nodes, or a number of copies using &#39;exact, and auto-sync is set, sometimes it just syncs the first replica and leaves any others unsynced and calls it job done. This is bad.<br>


<br>

</div>Can you provide us with a way to reproduce this? How did you detect that the remaining replicas were not sync&#39;ed?<br></blockquote><div><br></div><div>Detection was just by looking at the queue page in the management web GUI. It shows a big blue +1 and a big red +1 next to maybe 10% of queues after applying the global queue policy after all sync ops complete. If I issue a manual sync operation on all the problem queues, then they correctly finish syncing up the 3rd data copy. I&#39;ll see if I can script up a way to reproduce it on clean set of nodes, since I&#39;m trying not to break my prod cluster any more than I have this week. I&#39;ll e-mail the list once I&#39;ve got a reproducible test case.<br>

</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im">

&gt; - Attempted to create small per-queue policies to redistribute messages and then delete the per-queue policies, but this often leads to a inconsistent cluster state where queues continued to show as being part of a policy that was already deleted, attempt to resync, and get stuck, unable to complete or switch back to the global default policy.<br>


<br>

</div>Again, it would be helpful if you could help us to replicate this.<br></blockquote><div><br></div><div>This is 100% reproducible on our prod cluster. I&#39;ve got a python script that attempts the rebalancing on a cluster, so I&#39;ll add some logic to get it to generate and populate queues to reproduce this on a fresh cluster, and e-mail that out.<br>

</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im">

&gt; - sometimes the cluster refuses to accept any more policy commands. Have to fully shut down and restart the cluster to clear this condition.<br>

<br>

</div>And this. Can you provide a run down of these policies and the order in which you&#39;re trying to apply them? Also, how busy are the queues whilst the policy changes are happening? We may need to extend our test beds to reliably reproduce such problems.<br>

</blockquote><div><br></div><div>This case happens after attempting a bunch of policy operations from the previous mentioned script, so it should be easy enough to see it in action once I&#39;ve got a script to reproduce the previous issue. We saw this happening with as low as 5 messages/sec on the whole cluster, so it doesn&#39;t seem to be load related.<br>

</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


&gt; - sometimes policies applied to empty and inactive queues don&#39;t get correctly applied, and the queue hangs on &quot;resyncing / 100%&quot;.l<br>

<br>

What!?<br></blockquote><div><br></div><div>Yeah. That was my reaction as well. We saw this after removing the per-queue polices created with the previous mentioned script, after the queues reverted to the global exact/3/autosync policy. I had to actually kill all of my rabbitmq instances as they wouldn&#39;t nicely shut down, and then bring the whole cluster back up to clear this.<br>

</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im">

&gt; this makes no sense, given the queue is empty, and requires a full cluster restart to clear.<br>

<br>

</div>Please provide the commands you invoked to get this to happen.<br></blockquote><div><br></div><div>Again, this are all things noticed after running the script mentioned above to do the per-queue policies. I didn&#39;t intentionally do anything to make these errors occur, but once I&#39;ve got a script to reproduce the first set of errors on a fresh cluster, it should be easy enough to see some of these other issues, since they seem to cascade from the first set of problems.<br>

</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im">

&gt; - I&#39;ve managed to get the cluster into an inconsistent state a /lot/ using the HA features, so it feels like they need more automated stress testing and bulletproofing.<br>

<br>

</div>If you can help us repoduce these errors, I can assure you that they&#39;ll get included in our integration tests!<br></blockquote><div><br></div><div>Great. I&#39;ll get to work on being able to solidly reproduce at least the first set of issues I encountered, and hopefully that&#39;ll lead to a reproduction path for some of the other ones.<br>

</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im">

&gt; Persistent message storage:<br>

&gt;<br>

&gt; - it appears as if messages are put into very small batch files on the filesystem (1-20 MB)<br>

&gt; - this causes the filesystem to thrash if your IO isn&#39;t good at random IO (SATA disks) and you have lots of persistent messages (&gt;200k messages 500kB-1MB in size) that don&#39;t fit in RAM.<br>

&gt; - this caused CentOS 6 kernel to kill erlang after stalling the XFS filesystem for &gt; 120s.<br>

<br>

</div>Iirc this is tuneable, though we don&#39;t recommend changing it. Not at u desk right now though, so I can&#39;t remember the exact details.<br></blockquote><div><br></div><div>It doesn&#39;t seem to be an issue since we&#39;ve switched to SSDs, so I&#39;m not going to spend a lot of time worrying about it. It&#39;d just be nice to see some supported tuning options for this make a dev roadmap for the future.<br>

</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im">

&gt; - if a node crashes, Rabbit seems to rescan the entire on-disk datastore before continuing, instead of using some sort of checkpointing or journaling system to quickly recover from a crash.<br>

&gt; - all of above should be solvable by using an existing append-only datastore like eLevelDB or Bitcask.<br>

<br>

</div>On our todo list already, at least for the message store index.<br></blockquote><div><br></div><div>Great, glad to hear it.<br><br>There&#39;s probably a lot of performance improvements to be had by using something like eleveldb or bitcask, since they do a lot to optimize disk seeks and RAM buffering, but I imagine that&#39;s a fairly ambitious amount of work that isn&#39;t high particularly high priority for you guys. Just a suggestion to think about in the long term, or to put an intern on testing. ;)<br>

<br></div><div>In any case, I&#39;m going to get working on reproducible test cases for all the issues we&#39;ve been discussing. I&#39;ll update this thread when I have something concrete for you.<br></div><div><br></div>

<div>Graeme<br><br></div></div></div></div>