[rabbitmq-discuss] Handling network reliability problems on the publisher side

Stefan Kaes Stefan.Kaes at xing.com
Mon Dec 13 15:17:06 GMT 2010


Hi everyone,

as some of you may know, we’ve aimed at building a highly reliable messaging infrastructure around the idea of publishing messages redundantly to a number of identical rabbitmq instances along with deduplicating messages on the receiver side. (see http://xing.github.com/beetle/, https://github.com/xing/beetle and https://github.com/xing/perl-beetle).

The system has been in use in our production environment since April, and we were quite happy with it.

However, we recently had an incident which points to a weakness of our solution: we experienced a temporary network routing problem, which separated at least one of the rabbitmq servers from the publishing processes (mainly or web application). The separation lasted several minutes and lead to parts of our web application blocking in socket writes. In the end, our administrators restarted the web app.

It turned out that we lost quite a few messages, and here’s why: the publisher blocks only after the TCP buffers have been filled completely (even though TCP_NODELAY and SO_SNDTIMEO are set on the socket).

It looks like no programming cleverness on our side can achieve what we want: have the publisher block as soon as a message cannot be accepted by the intended rabbitmq server, because, as far as we understand the amqp protocol, no protocol level acknowledgements are sent which the publisher could wait for (and optionally timeout), before sending the next message.

This is all fine to achieve high throughput on the publisher side, but doesn’t quite fit our use case. For some messages, we really need to make reasonably sure the message has been received by a rabbitmq server, before we continue. We could then use timeouts to detect network partitioning and buffer messages locally until they can be sent (only if none of our three redundant servers can be reached, of course).

Turning on heartbeats can improve the situation somewhat, in that failures can be detected earlier, but since heartbeats are asynchronous in nature, they don’t fully solve our problem.

It currently seems as if the only way to get what we want is to switch our publishers to use the JSON-RPC plugin for rabbitmq.

I’d be interested to here what you think about the problem:


 *   is the analysis correct?
 *   is the JSON-RPC plugin stable enough to be used in production?
 *   maybe there’s a better solution available?


Regards,

-- stefan



Dr. Stefan Kaes
Principal System Architect

XING AG
Gaensemarkt 43, 20354 Hamburg, Germany
Tel. +49 40 419131-801, Fax +49 40 419131-11

Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 98807
Exec. Board (Vorstand): Dr. Stefan Groß-Selbeck (Vorsitzender), Ingo Chu, Burkhard Blum, Michael Otto, Dr. Helmut Becker
Chairman of the Supervisory Board (Aufsichtsratsvorsitzender): Dr. Neil Sunderland

Please join my network on XING: http://www.xing.com/go/invite/Stefan.Kaes

This email may contain confidential and/or privileged information. If you are not the intended recipient (or have received this email in error) please notify the sender immediately and destroy this email. Any unauthorized copying, disclosure or distribution of the material in this email is strictly forbidden and may be unlawful.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20101213/19b030e2/attachment.htm>


More information about the rabbitmq-discuss mailing list