[rabbitmq-discuss] lost message due to binding delay

Fri May 21 20:57:57 BST 2010

We tracked down an interesting bug today in a 1.7.2 cluster.  Our
setup is as follows:

cluster: 4 hosts, 1 node each  ("rabbit1", "rabbit2", "rabbit3", "rabbit4")
clients: 2 services, "serviceA" and "serviceB"; 1 or more processes each

In this situation both services are connected to rabbit1.  Both
services have a standard queue and binding setup that is built into
our application stack so that they can receive messages from our HTTP
bridge, and also receive messages that we send between services.  The
queues, bindings and consumers are all declared when the services
start, and the queues are by service name to distribute messages
between all instances of each service.

The bug occurs when we use the part of our stack that allows serviceA
to query serviceB and receive a response.  To be sure that the
response ends up with the right process, each process sets up a queue
that resolves to its host and pid. The queue and its bindings are not
allocated at startup but instead on-demand when services interact.

Our python driver is a fork of py-amqplib which uses libevent for all
IO and scheduling.  The driver has been in use for awhile now, though
it needs a lot of documentation before it is ready to be released into
the wild (we promise, we're working on it).  What this means is that
if we have multiple AMQP messages sent during the same event loop
cycle, the bytes reside locally in a buffer until the current cycle
completes.  When a write event is processed, we push as many bytes
into the socket buffer as possible, and in this case, likely all of
the bytes would be able to fit into the socket buffer.

So, when we perform inter-service communication, the first time it
occurs in that process, the bytes for setting up our subscription are
immediately followed by the message sent to the second service.  For
example:

serviceA: exchange_declare('response', 'topic')     # already exists
serviceA: queue_declare('serviceA.response.host.pid')
serviceA: queue_bind('serviceA.response.host.pid', 'response',
routing_key='serviceA.response.host.pid')
serviceA: basic_consume( 'serviceA.response.host.pid' )
serviceA: tx_select()
serviceA: basic_publish( <a message to serviceB> )
serviceA: tx_commit()
serviceB: on receipt, do a DB query and send response back to serviceA

Using our passive listener application, we can confirm that serviceB
writes a response with the correct routing keys, but serviceA never
receives it.  Subsequent messages skip everything before
basic_publish() and work as expected.

We are unable to reproduce this bug if we're running a single Rabbit
node.  The reason I suspect that it's a problem with the
exchange-queue binding is that all of the messages are flowing, which
means that Rabbit is handling the queue_* and tx_* methods in the
order in which we expect them to be processed.  Because we're running
this in a cluster, it is necessary for all nodes to register the
binding of the exchange to the queue.  I suspect that this is an
asynchronous operation, and that "rabbit1" has not confirmed that the
binding is in place by the time serviceB writes its response.  We
don't have exact timings, but the round-trip time for the request and
response is between 1 and 5ms.

For now, we're going to change our services so that they setup their
response queues and bindings when they start.  This will greatly
increase the number of queues that we have live, as our services are
all distributed and multi-process, but very few of them use
inter-service messaging.

-Aaron

-- 
Aaron Westendorf
Senior Software Engineer
Agora Games
359 Broadway
Troy, NY 12180
Phone: 518.268.1000
aaron at agoragames.com
www.agoragames.com