<div dir="ltr"><div><div><div><div><div><div><div><div><div><div><div>More fun with RabbitMQ clustering!<br><br></div>These next couple I wouldn't believe if I couldn't consistently reproduce them. Attached is a new script package which includes some updates to previous scripts, as well as the go based queue populator.<br>
<br></div>First, is an issue where if you apply a global policy, then populate a bunch of queues, after the queues are done populating rabbitmq removes about 3/4s of them. It is baffling, I've attached screenshots, since I can't really believe it myself. To reproduce:<br>
<br>- ./create_cluster.sh && ./setup_queues.sh && RABBITMQ_NODENAME="rabbit1@localhost" rabbitmqctl set_policy --priority 0 global_pol ".*" '{"ha-mode": "exactly", "ha-params": 3, "ha-sync-mode": "automatic"}' && sleep 5 && ./populate_queues.sh<br>
</div>- watch cluster admin queues page (<a href="http://localhost:4441/#/queues">http://localhost:4441/#/queues</a>), see all the queues fill up with 10000 messages, and then see 2/3rds of them disappear once most of the queues are full.<br>
</div><div>- This happens both in my test VM, and on a bare metal server with 64GB of RAM and 24 cores.<br></div><div><br></div>Next is an issue where removing and re-adding nodes breaks rabbitmq clustering. We keep running into this in prod where we'll attempt to adjust our cluster topology, things will break, and we'll have to take the whole cluster down and bring it back up again to fix it.<br>
<br>- ./create_cluster.sh && ./setup_queues.sh && ./populate_queues.sh && RABBITMQ_NODENAME="rabbit1@localhost" rabbitmqctl set_policy --priority 0 global_pol ".*" '{"ha-mode": "exactly", "ha-params": 3, "ha-sync-mode": "automatic"}'<br>
</div>- watch cluster admin pages, wait until all messages are populated and queues are synced, noting that since we applied the policy after populating the queues, for some reason this doesn't cause the queues to be removed like in the previous case.<br>
</div>- ./toggle_nodes.sh<br></div>- watch nodes be removed and re-added, should only take ~5 of these full cycles before the script loop hangs, and doesn't return from the cluster op it's attempting to perform. If you ctl-C the script and run it again, it should just hang and refuse to perform any more cluster join/leave operations on any node.<br>
</div>- it's also likely you'll see queues where one of the mirrors isn't correctly synced, or possibly is partially synced but stuck and not finishing syncing, likely related to previous policy bugs I reported.<br>
</div><div>- queues in these states often don't accept new messages for delivery, stalling message processing.<br></div><div><br></div>We've found that once the cluster's in this state, it behaves really oddly and needs to be fully shut down (or "killall beam.smp") and then brought back up before it behaves normally. We had an incident after adding a single node last friday where ~4 queues stopped accepting new messages and held up our entire workload until the entire cluster was shut down and brought back up.<br>
<br></div>These errors are all really strange, and so I'm hoping you guys can reproduce them, and best case scenario, find something that accounts for these problems which we can then patch in our production environment.<br>
</div><br>Thank!<br>Graeme<br><br></div>