[rabbitmq-discuss] lost message due to binding delay

Wed May 26 16:42:18 BST 2010

Hi Aaron. Interesting problem! There are a few possibilities for what 
could be happening, see below...

On 21/05/10 20:57, Aaron Westendorf wrote:
> We tracked down an interesting bug today in a 1.7.2 cluster.  Our
> setup is as follows:
>
> cluster: 4 hosts, 1 node each  ("rabbit1", "rabbit2", "rabbit3", "rabbit4")
> clients: 2 services, "serviceA" and "serviceB"; 1 or more processes each
>
> In this situation both services are connected to rabbit1.  Both
> services have a standard queue and binding setup that is built into
> our application stack so that they can receive messages from our HTTP
> bridge, and also receive messages that we send between services.  The
> queues, bindings and consumers are all declared when the services
> start, and the queues are by service name to distribute messages
> between all instances of each service.
>
> The bug occurs when we use the part of our stack that allows serviceA
> to query serviceB and receive a response.  To be sure that the
> response ends up with the right process, each process sets up a queue
> that resolves to its host and pid. The queue and its bindings are not
> allocated at startup but instead on-demand when services interact.
>
> Our python driver is a fork of py-amqplib which uses libevent for all
> IO and scheduling.  The driver has been in use for awhile now, though
> it needs a lot of documentation before it is ready to be released into
> the wild (we promise, we're working on it).  What this means is that
> if we have multiple AMQP messages sent during the same event loop
> cycle, the bytes reside locally in a buffer until the current cycle
> completes.  When a write event is processed, we push as many bytes
> into the socket buffer as possible, and in this case, likely all of
> the bytes would be able to fit into the socket buffer.

I'm not completely sure I understand but if you're saying that your 
driver pipelines synchronous methods (e.g. sends queue.bind in the 
dialogue below before receiving queue.declare-ok from the broker) then 
that's very bad. We don't attempt to detect this in the broker (as any 
such detection would be very racy) but it violates the spec and could 
cause any sort of weird behaviour. So that's possibility #1.

> So, when we perform inter-service communication, the first time it
> occurs in that process, the bytes for setting up our subscription are
> immediately followed by the message sent to the second service.  For
> example:
>
> serviceA: exchange_declare('response', 'topic')     # already exists
> serviceA: queue_declare('serviceA.response.host.pid')
> serviceA: queue_bind('serviceA.response.host.pid', 'response',
> routing_key='serviceA.response.host.pid')
> serviceA: basic_consume( 'serviceA.response.host.pid' )
> serviceA: tx_select()
> serviceA: basic_publish(<a message to serviceB>  )
> serviceA: tx_commit()
> serviceB: on receipt, do a DB query and send response back to serviceA
>
> Using our passive listener application, we can confirm that serviceB
> writes a response with the correct routing keys, but serviceA never
> receives it.  Subsequent messages skip everything before
> basic_publish() and work as expected.

I note you don't talk about channels here. In AMQP the only real 
ordering guarantees are within channels. So if each service is using 
more than one channel in the dialogue above then that's possibility #2.

> We are unable to reproduce this bug if we're running a single Rabbit
> node.  The reason I suspect that it's a problem with the
> exchange-queue binding is that all of the messages are flowing, which
> means that Rabbit is handling the queue_* and tx_* methods in the
> order in which we expect them to be processed.  Because we're running
> this in a cluster, it is necessary for all nodes to register the
> binding of the exchange to the queue.  I suspect that this is an
> asynchronous operation, and that "rabbit1" has not confirmed that the
> binding is in place by the time serviceB writes its response.  We
> don't have exact timings, but the round-trip time for the request and
> response is between 1 and 5ms.

We have recently found and fixed a bug which could cause messages 
(internal to Rabbit, so they could be basic.publish, tx.commit or 
various other things) to overtake each other in certain (we thought) 
theoretical circumstances. So this could be possibility #3.

*However*, queue.bind is not one of the messages that could be 
overtaken. And I'm afraid binding to a queue is in fact a synchronous 
operation anyway. So the explanation as presented can't be quite right. 
And I can't immediately see any other way for something to get overtaken 
and cause the results you're seeing, although it would help to see the 
dialogue expanded to show the bit where serviceB replies.

But you might want to try compiling from hg (default branch) and see if 
that fixes your problem.

Cheers, Simon