[rabbitmq-discuss] lost message due to binding delay

Wed May 26 20:16:16 BST 2010

Simon,

Thank you for the reply.  Responses inline.

On Wed, May 26, 2010 at 11:42 AM, Simon MacMullen <simon at rabbitmq.com> wrote:
>
> I'm not completely sure I understand but if you're saying that your
> driver pipelines synchronous methods (e.g. sends queue.bind in the
> dialogue below before receiving queue.declare-ok from the broker) then
> that's very bad. We don't attempt to detect this in the broker (as any
> such detection would be very racy) but it violates the spec and could
> cause any sort of weird behaviour. So that's possibility #1.

Your understanding is correct, and a year ago we implemented a pending
queue for basic.consume after running into problems when multiple
consume requests were pending.  From what you're saying, this scope
needs to be expanded to anything with a corresponding "_ok" method,
correct?  Is there another rule on which methods are synchronous and
which are not?  For instance, we wrote basic.publish_synchronous to
abstract the basic.publish and tx.commit behavior, and that also
needed to implement a pending queue of basic.publish calls that is
drained as we receive tx.commit_oks, even though basic.publish itself
is not a synchronous call.

The drivers speed and resource requirements benefit mostly from the
evented IO and timers, so we can implement queuing without much
overall impact.  It seems that much of the spec is written with the
expectation of synchronous request-response, which makes an
asynchronous driver a bit challenging at times.  It does allow our
highest-trafficked applications, the protocol bridges (HTTP - Rabbit)
to have very high throughput with low overhead.  To date, the
buffering of data hasn't adversely impacted memory usage because
Rabbit kindly works fast enough to not keep data buffered in our
clients for long.

> I note you don't talk about channels here. In AMQP the only real
> ordering guarantees are within channels. So if each service is using
> more than one channel in the dialogue above then that's possibility #2.

In the context of the transaction taking place, we're using a single
channel.  We can make use of multiple channels, but not in this case,
and we're always using a single channel for a transaction (in the app
sense, not tx.* sense).

> We have recently found and fixed a bug which could cause messages
> (internal to Rabbit, so they could be basic.publish, tx.commit or
> various other things) to overtake each other in certain (we thought)
> theoretical circumstances. So this could be possibility #3.
>
> *However*, queue.bind is not one of the messages that could be
> overtaken. And I'm afraid binding to a queue is in fact a synchronous
> operation anyway. So the explanation as presented can't be quite right.
> And I can't immediately see any other way for something to get overtaken
> and cause the results you're seeing, although it would help to see the
> dialogue expanded to show the bit where serviceB replies.

ServiceB would call basic.publish and tx.commit only, having already
set up its outbound channel.

> But you might want to try compiling from hg (default branch) and see if
> that fixes your problem.

Will give it a try, but our main cluster is now in production so we
will have to spin up our test hosts and an isolated test case.

Please let us know what you think of the synchronous vs. pipelining
question, i.e. how to know when we should buffer a request.

cheers,
Aaron

-- 
Aaron Westendorf
Senior Software Engineer
Agora Games
359 Broadway
Troy, NY 12180
Phone: 518.268.1000
aaron at agoragames.com
www.agoragames.com