[rabbitmq-discuss] Feature Req / Bug list

Mon Nov 4 20:29:32 GMT 2013

Ok, so here's the issues I'm still seeing with the latest nightly build
(rabbitmq-server-3.2.0.41104-1.noarch):

- multiple simultaneous API calls on the CLI from running rabbitmqctl and
amqp-delete-queue in parallel from bash still cause weird buggy responses
from RMQ. running my setup_queues.sh multiple times is enough to trigger
it. Seems to be primarily running the operations in the first loop that
have problems, the second loop (which creates the queues) seems to run fine
in parallel, which is good.
- this is also the case with my rebalance_cluster.sh script when run in
parallel, the API handler breaks and skips applying about half of the new
policies.
- bug where policies with > 2 replicas don't get properly auto-replicated
is still present. Running rebalance_cluster.sh multiple times causes many
queues to end up under-replicated after the first pass. however, there
seems to be a workaround: if I apply a global replication policy of
"exactly 3", and then delete the per-queue policies, this seems to force a
resync as it switches queues to the global policy, and so they end up fully
replicated to the 3 nodes I specified with the per-queue policies.
- still lots of replication problems related to pulling and re-adding
nodes. running the toggle_nodes.sh script in a loop with the global
"exactly 3" policy demonstrates that as nodes are removed, it doesn't
automatically replicate to a different node, just leaves queues with only 1
replica, and doesn't always re-replicate back to the node which was pulled
after it's been re-added. after a few iterations (3-5), this causes data
loss on queues, with queues that have sequentially lost all their replicas
ending up empty, even though the global policy says they should be ensuring
there's always 3 copies of the data.
- running toggle_nodes.sh in a loop will also still get to the point where
rabbitmq refuses to do any further node leave/join operations until the
cluster is fully killed and brought back up.
- this also eventually causes a few of the queues to enter the weird state
where they list unsyncronized mirrors, but aren't attempting to sync. at
this point, rebalance_cluster.sh will also fail, and the cluster will
reject all policy operation requests, until it's fully killed and restarted.

On the bright side:

- loading 10k messages per queue seems faster on my VM than with the 3.2.0
release, even when doing 3x local replication.
- bug where it was deleting queues as they approached 10k messages when
policies had been applied seems to be gone.
- cluster management functions run sequentially from scripts seems to be
safe now, haven't been able to generate any bad behaviour as long as I'm
not running more than one cluster command at a time, and as long as I'm not
removing and adding nodes.
- have a known good way to rebalance queues across cluster systems, even if
it is slow and requires a workaround to get fully synced.

I'm really hoping you guys can replicate the more severe errors around
polices and node management. I'm happy to provide thorough reproduction
steps, information, logs, etc for anything I've reported here, if you guys
are having issues reproducing internally.

Graeme

On Mon, Nov 4, 2013 at 11:00 AM, Graeme N <graeme at sudo.ca> wrote:

> Hi Simon,
>
> It does seem likely to me that fixing the initial issues I found would
> resolve the other problems, so I'll definitely re-test within the next day
> or so using the latest nightly build, and let you know how it goes.
>
> Thanks!
> Graeme
>
>
> On Mon, Nov 4, 2013 at 6:05 AM, Simon MacMullen <simon at rabbitmq.com>wrote:
>
>> On 29/10/2013 12:06AM, Graeme N wrote:
>>
>>> These errors are all really strange, and so I'm hoping you guys can
>>> reproduce them, and best case scenario, find something that accounts for
>>> these problems which we can then patch in our production environment.
>>>
>>
>> We haven't been able to recreate your most recent issues - but it's
>> possible that they're all copies of the same underlying problem as the
>> first lot you reported (which are now fixed).
>>
>> So would you be able to have another go, with a recent nightly build?
>>
>> http://www.rabbitmq.com/nightly-builds.html
>>
>>
>> Cheers, Simon
>>
>> --
>> Simon MacMullen
>> RabbitMQ, Pivotal
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131104/aac1cafa/attachment.htm>