[rabbitmq-discuss] Erlang client RPC and dropped messages

Mon Apr 26 20:37:55 BST 2010

Hi Noah,

On Mon, Apr 26, 2010 at 01:58:24PM -0500, Noah Fontes wrote:
> 1. The connection to the remote RabbitMQ exchange is dropped (often this
> is because I accidentally let way too many messages build up and the
> node crashes, but that's a topic for another day and I'm guessing the
> new persister is going to fix this issue quite nicely); however, no-one
> is notified of the dropped connection because as far as I can tell the
> checks for this are only run when data actually goes across the connection.

Correct, though you could set heartbeat to non-zero. That should get you
more prompt notification.

> 2. A message is read from the local queue and published to the remote
> exchange (via amqp_channel:cast/3), which appears to be successful.

Be aware of cast. Cast returns as soon as the message has been added to
the writer's mailbox (and actually can be sooner...). Needless to say,
this does not suggest the message has made it out of the socket, or even
been looked at by the socket writer. In general, I tend to use
amqp_channel:call, not cast, for almost everything as it avoids millions
of messages backing up in the mailbox of the writer process.

> Relevant comments from rabbit_writer.erl are included here:
> %% So instead we lift the code from prim_inet:send/2, which is what
> %% gen_tcp:send/2 calls, do the first half here and then just process
> %% the result code in handle_message/2 as and when it arrives.
> %%
> %% This means we may end up happily sending data down a closed/broken
> %% socket, but that's ok since a) data in the buffers will be lost in
> %% any case (so qualitatively we are no worse off than if we used
> %% gen_tcp:send/2), and b) we do detect the changed socket status
> %% eventually, i.e. when we get round to handling the result code.

Indeed, so using call, not cast, pretty much gets you to this point, but
obviously no further.

> 3. After the message is "written" to the exchange, the connection is
> seen as closed, messages are sent out to listening Erlang processes, and
> a new connection is subsequently re-established by my code.
> 
> However, at this point the message that caused the connection drop to be
> noticed is permanently lost; since the connection wasn't actually active
> when it was published it can't possibly be rejected, and since no errors
> were thrown at publish-time, it appears as if the message was sent
> successfully. In our code, this results in ~50% data loss when a node
> unexpectedly goes down.

Right. What you're doing is fine, but with your approach you clearly
need to hold on to the most recent message you received in case the
connection drops and you then need to resend it. Furthermore, as you're
using cast, you could have millions of messages queued up with the
writer process mailbox which all would be lost.

However, you're not acking. So, on connection drop to the remote, you
could also connection drop to the local. Then, the local broker will
requeue everything that's not been acked, and when you reconnect, you'll
find it all there. Now at that point, any ordering guarantees you had go
out the window, but that may not be a concern. Assuming you're using
basic.consume, you could set qos to prefetch of 1, which would then mean
that at most one message is buffered in the client without it being
acked, significantly limiting your exposure.

Finally, if you want to be really sure, use transactions to the
destination - that way you know you have to hang on to everything you've
published up until you get the commit_ok back, and then you are
guaranteed that it's been received.

And in an obvious plug, our shovel is capable of dealing with these
issues. ;)

Matthew