[rabbitmq-discuss] Erlang client RPC and dropped messages

Mon Apr 26 19:58:24 BST 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,

I'm encountering an issue with publishing messages that are
mandatory/persisted (i.e., we can't risk dropping the messages). I have
a tool similar to rabbitmq-shovel (developed slightly before it). It
works like this:

1. A message is read from a queue on a local RabbitMQ instance.
2. The message is then published to a remote exchange, and if no errors
occur, acked. I attempt re-publish the message if I get a rejected call
from the remote RabbitMQ instance. This works almost all the time,
however...

I'm also listening for connection drops on the local and remote
connections, and re-establishing them if they go down. So something like
this occurs:

1. The connection to the remote RabbitMQ exchange is dropped (often this
is because I accidentally let way too many messages build up and the
node crashes, but that's a topic for another day and I'm guessing the
new persister is going to fix this issue quite nicely); however, no-one
is notified of the dropped connection because as far as I can tell the
checks for this are only run when data actually goes across the connection.

2. A message is read from the local queue and published to the remote
exchange (via amqp_channel:cast/3), which appears to be successful.
Relevant comments from rabbit_writer.erl are included here:
%% So instead we lift the code from prim_inet:send/2, which is what
%% gen_tcp:send/2 calls, do the first half here and then just process
%% the result code in handle_message/2 as and when it arrives.
%%
%% This means we may end up happily sending data down a closed/broken
%% socket, but that's ok since a) data in the buffers will be lost in
%% any case (so qualitatively we are no worse off than if we used
%% gen_tcp:send/2), and b) we do detect the changed socket status
%% eventually, i.e. when we get round to handling the result code.

3. After the message is "written" to the exchange, the connection is
seen as closed, messages are sent out to listening Erlang processes, and
a new connection is subsequently re-established by my code.

However, at this point the message that caused the connection drop to be
noticed is permanently lost; since the connection wasn't actually active
when it was published it can't possibly be rejected, and since no errors
were thrown at publish-time, it appears as if the message was sent
successfully. In our code, this results in ~50% data loss when a node
unexpectedly goes down.

I'm open to suggestions on a better way to architect my code to get
around this issue, but it seems (to me, and I must confess my general
ignorance to the implementation) like a bug or at least something that
should be addressed/cautioned about.

Regards and thanks in advance for any advice!

Noah
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)

iEYEARECAAYFAkvV4lAACgkQhitK+HuUQJTdbwCffk73y5iuzUOuBNoqi+bZpa3X
NtcAoKZKOd2G8TDeelp6uKfEV9I7QcGQ
=FvNz
-----END PGP SIGNATURE-----