[rabbitmq-discuss] Fw: high availability solutions?

Fri Jun 17 09:45:02 BST 2011

Hi Jason,

On Thu, Jun 16, 2011 at 12:24:27PM -0600, Jason J. W. Williams wrote:
> * The fact that messages in a transactions published to a mirrored
> queue are silently dropped is inevitably setting up a customer to lose
> messages when they forget about that property. Why not error out the
> transaction so the problem is actively in their face, or at least give
> it a best effort?

I wouldn't worry - we'll likely be dropping support for transactions
completely soon so any attempt to use tx.* will result in at least a
channel error.

> * If publishes are parallel to the master queue and it's slaves, how
> are consumes and other actions that are slaved serially off the master
> able to keep the slaves in sync since you're mixing parallel ops with
> serial ones?

Consumers are never aware of the slaves. They talk only to the master,
using Erlang's normal inter-node messaging.

Whilst publishes go in parallel to all mirrors, there are race
conditions there, thus the master informs the slaves of what order the
publishes should be accepted in. The _only_ reason that publishes go to
slaves as well as the master from publishing channels is so that in the
event of the death of the master, there is no window in which msgs can
be lost.

> * If consumers are going to be considered disconnected, why aren't
> they actually disconnected so they can use their assumptions about
> lost connections to assume acknowledgements in flight were lost etc
> (and re-use app logic they already have)?

It's far better to use the consumer cancel notification that we added -
much lighter-weight. If we closed the channel, then there would be other
consequences - for example, that channel could be consuming from dozens
of queues. By closing it, all outstanding non-acked msgs from all those
other queues would now have to be needlessly requeued.

> * The lack of synchronization for new Q slaves is a bummer,
> particularly in the case where a rejoining slave tosses it's durable
> messages regardless of whether those message have actually been
> consumed. It seems like there should be a history log shared across
> the masters and slaves so a rejoining slave can sync up off the master
> and reconcile it's durable contents.

That'll be fixed eventually.

> * Why do mirror slaves have to be specified by node name, rather than
> just passing a "numCopies=X" to the queue on creation and letting the
> cluster handle the distribution? It seems to be shifting complexity to
> the client that should be in the cluster which has the best knowledge
> of how many members exist and other internals.

We all agree that leaking details of the cluster to the clients is very
likely not what's wanted in most cases, especially node names. However,
the numCopies is far too inflexible. There are a number of suggestions
in this area and they will be implemented in the future. However, there
are genuine reasons why a client may want to control precisely which
nodes get mirrors on them, or rather, why Rabbit may not be best placed
to decide which nodes.

Yes, Rabbit could start doing load monitoring and all that jazz, but for
example, one thing it's not aware of is whether it's in a virtualised
environment. It seems to me quite reasonable to want to ensure that your
mirrors will not end up on nodes that are actually on the same physical
machine, albeit different guests - if they do, then you're hardly
protecting your queues from spectacular hardware failure.

So yes, none of us like the specification of node names. However, I'm
keen to ensure any replacement offers, does not preclude the same level
of flexibility. Yes - all sorts of better layers of indirection and
abstraction would make this more pleasant. But ultimately, I do want to
ensure we don't lose this ability.

FWIW, discussion about mirror specification is why this work did not
make it into 2.5.0.

Matthew