[rabbitmq-discuss] Concerning cluster failure scenario

Mon Aug 23 14:10:48 BST 2010

On Sat, Aug 21, 2010 at 9:54 AM, David Wragg <david at rabbitmq.com> wrote:

> You say that an application node went down, and queues disappeared.  Is
> it possible that those queues has been declared on that node?

The queues that disappeared were on other nodes and were not durable.
It is possible that our application stack is responsible for the
queues disappearing.  We have numerous instances of probably 100
different applications running on our cluster, and when something
dies, forensic gathering clashes with urgent repairs.

Interim work and a camping trip clouds some of the details.

>> The logs when the other node went down have many entries similar to
>> the following:
>>
>> =ERROR REPORT==== 16-Aug-2010::18:06:54 ===
>> connection <0.20615.10> (running), channel 2 - error:
>> {amqp_error,internal_error,
>>             "commit failed:
>
> My guess is that these are due to the crash itself.

Yeah, I expect so.  We see that anytime there's a disconnect within the cluster.

>> =ERROR REPORT==== 16-Aug-2010::18:06:56 ===
>> connection <0.22741.65> (running), channel 1 - error:
>> {amqp_error,not_found,"no queue 'ogre.28645' in vhost
>> '/hydra'",'queue.bind'}
>
> This is what I would expect if the queues in question resided on the
> crashed node.

It's a race condition that is very common for us.  The destination
queue was on the crashed node, so a commit failed.  The publisher
disconnects and then tries to reconnect, setting up queues and
bindings.

> AMQP applications often bind queues immediately after declaring them.
> But from your description of this error, it sounds like the binding
> happens quite separately from the declaration.  Is that right?

No they're in rapid succession.  I know that our libevent integration
with py-amqplib isn't perfect with respect to synchronous vs.
asynchronous transactions, but binding errors only occur when we
quickly disconnect the one consumer of a non-durable queue, then
reconnect and redeclare the queue and the bindings.

Looking at our default behavior, I think the problem is that
nowait=False, when it should be true.

>> So both logs and rabbitmqctl were in
>> agreement that the queues which should have existed for our
>> translators did not exist.  I didn't see any errors about the
>> queue.basic_consume calls though.
>
> That is indeed surprising.  Do you have the output of 'rabbitmqctl
> list_bindings' from this point?

I don't, but it was consistent with other rabbitmqctl output.

>> Our translators have some test endpoints, and one of those is a ping
>> which writes directly to the response exchange to effectively test
>> that the translator and Rabbit are working together.  The response
>> exchange is living on the same node which the translator is connected
>> to and consuming from.
>
> Note that exchanges do not live on a particular node, unlike queues.

I know, but in this case, there was a distinct branch of behavior
depending on which node was being published to.  It was almost as if
all rabbitmqctl output, and the configurations of other nodes, was
driven entirely from mnesia, but that internally there was still a
reference to the queue, the binding and the consumer.

>> When we called this test endpoint, it succeeded!  Though rabbit did
>> not report a queue or binding, every single on of our translators was
>> receiving responses though they never should have.  When one of our
>> services wrote back to the response exchange on other node in the
>> cluster, the message was dropped as we would expect.
>
> Again, the output of 'rabbitmqctl list_bindings' would be useful, if you
> have it.
>
> One more general question: This is a fairly elaborate clustering set-up.
> While we intended that clustering works properly, and I hope we can get
> to the bottom of your problem, the non-uniform nature of the current
> RabbitMQ clustering can make it quite complicated to administer.  So I
> wonder what requirements led you to this design?

How do you mean non-uniform?

We think we've modeled our environment on best practices.  This report
is complicated because it covers our HTTP interfaces, but otherwise we
have asynchronous applications connected to one of several nodes.  We
use clustering to share our workload and give us fallback hosts.  If
one suite of apps fails to process data fast enough, the worst they'll
do is back up one of the cluster nodes and not take down our entire
infrastructure.  We can divide the workflow evenly and monitor the
hosts to ensure that cpu, ram and disk are within bounds.  We're
over-allocated on capacity at the moment, but expect an order of
magnitude growth over the next 6-9 months.

Thanks for your time!

cheers,
Aaron

-- 
Aaron Westendorf
Senior Software Engineer
Agora Games
359 Broadway
Troy, NY 12180
Phone: 518.268.1000
aaron at agoragames.com
www.agoragames.com