[rabbitmq-discuss] Concerning cluster failure scenario

Sat Aug 21 14:54:18 BST 2010

Hi Aaron,

Some questions below.

Aaron Westendorf <aaron at agoragames.com> writes:
> We experienced a very odd and disturbing outage yesterday.  I'll do my
> best to explain and can fill in any missing details as needed.
>
> We have a 4 host/node cluster of 1.7.2 rabbits.  One node serves our
> translators that bridge synchronous HTTP traffic to our backend
> services.  Two other nodes handle our services and one is a spare.
> The translators have queues named "$host.$pid" and which are bound to
> the "response" exchange using routing keys of the same name.
>
> One of the application nodes went down, apparently due to an outright
> crash.  Monit caught that rabbit wasn't running and restarted it.  All
> the rabbit hosts and our services saw this as a socket disconnect
> without any closure method.  The only immediate fallout from this was
> a mis-handling in our application stack for socket drops.  Combing
> through the logs yielded nothing; it appears Erlang crashed hard.
>
> The really strange behavior happened at the node which serves our
> translators.  Running `rabbitmqctl list_queues` it was clear that most
> of the queues that should exist did not, including the ones which our
> translators need.

You say that an application node went down, and queues disappeared.  Is
it possible that those queues has been declared on that node?  As out
clustering guide at <http://www.rabbitmq.com/clustering.html> mentions,
queues reside on a single node - the one where they were declared.  So how
did your translator queues get declared?  Is it possible that they were
declared on the node that crashed?  Were the queues declared as durable?

> The logs when the other node went down have many entries similar to
> the following:
>
> =ERROR REPORT==== 16-Aug-2010::18:06:54 ===
> connection <0.20615.10> (running), channel 2 - error:
> {amqp_error,internal_error,
>             "commit failed:
> [{exit,{{nodedown,rabbit at jackalope},{gen_server2,call,[<7282.218.0>,{commit,{{11,<0.20706.10>},98789}},infinity]}}}]",
>             'tx.commit'}

My guess is that these are due to the crash itself.

> Those errors propagated up our application stack where our translators
> re-connected to the broker (fresh socket).  That lead to this error
> which is very common:
>
> =ERROR REPORT==== 16-Aug-2010::18:06:56 ===
> connection <0.22741.65> (running), channel 1 - error:
> {amqp_error,not_found,"no queue 'ogre.28645' in vhost
> '/hydra'",'queue.bind'}

This is what I would expect if the queues in question resided on the
crashed node.

AMQP applications often bind queues immediately after declaring them.
But from your description of this error, it sounds like the binding
happens quite separately from the declaration.  Is that right?

> We have added a delay to some of our applications so that reconnection
> happens after a second or two to avoid this race condition, and will
> make that change here too.  So both logs and rabbitmqctl were in
> agreement that the queues which should have existed for our
> translators did not exist.  I didn't see any errors about the
> queue.basic_consume calls though.

That is indeed surprising.  Do you have the output of 'rabbitmqctl
list_bindings' from this point?

> Our translators have some test endpoints, and one of those is a ping
> which writes directly to the response exchange to effectively test
> that the translator and Rabbit are working together.  The response
> exchange is living on the same node which the translator is connected
> to and consuming from.

Note that exchanges do not live on a particular node, unlike queues.

> When we called this test endpoint, it succeeded!  Though rabbit did
> not report a queue or binding, every single on of our translators was
> receiving responses though they never should have.  When one of our
> services wrote back to the response exchange on other node in the
> cluster, the message was dropped as we would expect.

Again, the output of 'rabbitmqctl list_bindings' would be useful, if you
have it.

One more general question: This is a fairly elaborate clustering set-up.
While we intended that clustering works properly, and I hope we can get
to the bottom of your problem, the non-uniform nature of the current
RabbitMQ clustering can make it quite complicated to administer.  So I
wonder what requirements led you to this design?

David

-- 
David Wragg
Staff Engineer, RabbitMQ
SpringSource, a division of VMware