[rabbitmq-discuss] HA - missing or incompletely replicated queues

Mon Nov 7 14:15:23 GMT 2011

>
> You're aware that there is no eager synchronisation of HA queues, yes?
> So it's only by the unsynchronised head of each queue being consumed
> that synchronisation occurs.

Yes, I allow this to happen.

> Did the previously downed node really come back up and join the cluster
> correctly? Does the output of rabbitmqctl cluster_status on each of the
> 3 nodes report all 3 nodes are running?

I couldn't tell you that without starting the tests again, but the
management plugin reports they are all up, and producers + consumers
reconnect to the downed node once it has come back up without problems.

> > I've also found that sometimes queues stop delivering messages when
> certain
> > nodes go down (even after being left for minutes), despite being in HA
> mode
> > (haven't been able to dig into this more yet).
>
> When this happens, could you check the logs please on all nodes for any
> entries regarding the queues. If a node with a queue master goes down
> then there should be entries about some slave on another node being
> promoted, but even if it's just a slave that dies, there should be
> entries in the logs that show others have noticed that.

Yes, will look for that.

> > Sometimes connections to nodes which have gone down are still shown
> > and get stuck.
>
> Interesting. That might be a bug in the mgmt plugin. Does rabbitmqctl
> list_connections also show such phantom connections?

Will check next time I see it - running some more tests this PM.

> > Today, while bringing up the cluster from scratch (shutdown all
> instances,
> > wipe mnesia, restart) I've got 3 nodes running, but an HA queue with 1
> > master, 2 synced slaves and 1 unsynced slave. Other queues are showing 1
> > master and 2 synced slaves as expected. (see
> >
> http://www.evernote.com/shard/s53/sh/b6345885-88d1-4d21-9614-24abda75a1cb/c2a0dd265b39d21f3e8c336c67ced979
> > )
>
> Well, drain the "unsynced" queue and it'll become synced.

My issue here is that for this one queue, it claims to have more copies of
the queue than nodes in the cluster. Trying to check if this is a plugin
bug with list_queues (list_queues name slave_pids synchronised_slave_pids),
I can't get the cluster to list them at all, it just sits there (for the
last 15 minutes).

A
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20111107/372d65c8/attachment.htm>