[rabbitmq-discuss] Concerning cluster failure scenario

Tue Aug 17 19:24:22 BST 2010

We experienced a very odd and disturbing outage yesterday.  I'll do my
best to explain and can fill in any missing details as needed.

We have a 4 host/node cluster of 1.7.2 rabbits.  One node serves our
translators that bridge synchronous HTTP traffic to our backend
services.  Two other nodes handle our services and one is a spare.
The translators have queues named "$host.$pid" and which are bound to
the "response" exchange using routing keys of the same name.

One of the application nodes went down, apparently due to an outright
crash.  Monit caught that rabbit wasn't running and restarted it.  All
the rabbit hosts and our services saw this as a socket disconnect
without any closure method.  The only immediate fallout from this was
a mis-handling in our application stack for socket drops.  Combing
through the logs yielded nothing; it appears Erlang crashed hard.

The really strange behavior happened at the node which serves our
translators.  Running `rabbitmqctl list_queues` it was clear that most
of the queues that should exist did not, including the ones which our
translators need.  The logs when the other node went down have many
entries similar to the following:

=ERROR REPORT==== 16-Aug-2010::18:06:54 ===
connection <0.20615.10> (running), channel 2 - error:
{amqp_error,internal_error,
            "commit failed:
[{exit,{{nodedown,rabbit at jackalope},{gen_server2,call,[<7282.218.0>,{commit,{{11,<0.20706.10>},98789}},infinity]}}}]",
            'tx.commit'}

Those errors propagated up our application stack where our translators
re-connected to the broker (fresh socket).  That lead to this error
which is very common:

=ERROR REPORT==== 16-Aug-2010::18:06:56 ===
connection <0.22741.65> (running), channel 1 - error:
{amqp_error,not_found,"no queue 'ogre.28645' in vhost '/hydra'",'queue.bind'}

We have added a delay to some of our applications so that reconnection
happens after a second or two to avoid this race condition, and will
make that change here too.  So both logs and rabbitmqctl were in
agreement that the queues which should have existed for our
translators did not exist.  I didn't see any errors about the
queue.basic_consume calls though.

Our translators have some test endpoints, and one of those is a ping
which writes directly to the response exchange to effectively test
that the translator and Rabbit are working together.  The response
exchange is living on the same node which the translator is connected
to and consuming from.

When we called this test endpoint, it succeeded!  Though rabbit did
not report a queue or binding, every single on of our translators was
receiving responses though they never should have.  When one of our
services wrote back to the response exchange on other node in the
cluster, the message was dropped as we would expect.

-Aaron

-- 
Aaron Westendorf
Senior Software Engineer
Agora Games
359 Broadway
Troy, NY 12180
Phone: 518.268.1000
aaron at agoragames.com
www.agoragames.com