[rabbitmq-discuss] Cluster recovery due to network outages

Wed Aug 4 14:45:40 BST 2010

Hi Aaron,

> =ERROR REPORT==== 1-Aug-2010::06:13:24 ===
> ** Node rabbit at caerbannog not responding **
> ** Removing (timedout) connection **
> 
> =INFO REPORT==== 1-Aug-2010::06:13:24 ===
> node rabbit at caerbannog down

As the error message suggests, it means mnesia timed out a connection to
another node.

There was a discussion about this a while ago
http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2010-March/006508.html

If you're expecting frequent short outages, you might consider
tweaking the timeout parameters as described above.

> A short time later the hosts recovered, also as we've seen before:
> 
> =INFO REPORT==== 1-Aug-2010::06:26:48 ===
> node rabbit at caerbannog up
> =ERROR REPORT==== 1-Aug-2010::06:26:48 ===
> Mnesia(rabbit at bigwig): ** ERROR ** mnesia_event got
> {inconsistent_database, running_partitioned_network,
>  rabbit at caerbannog}
> 
> =ERROR REPORT==== 1-Aug-2010::06:26:48 ===
> Mnesia(rabbit at bigwig): ** ERROR ** mnesia_event got
> {inconsistent_database, starting_partitioned_network
> , rabbit at caerbannog}
> 

During the outage, the nodes were out of contact with each other for
so long that mnesia is worried about possible inconsistencies.

The simplest solution would be to take down 3 of the nodes and
restart them.  This should allow them to sync with the fourth.

There's a longer explanation available here.

http://www.erlang.org/doc/apps/mnesia/Mnesia_chap7.html#id2277661

> Another 15 minutes later and the timedout errors were logged and that
> was the end of the cluster; I think two of the nodes figured out how
> to connect back to each other, but two others remained on their own.
> The hosts and nodes themselves never shutdown, and when I restarted
> just one of the nodes later in the day, the whole cluster rediscovered
> itself and all appeared to be well (`rabbitmqctl status` was
> consistent with expectations).
> 
> So our first problem is that the nodes did not re-cluster after the
> second outage.

If this was caused by the inconsistent_database errors, there's not
much you can do apart from a restart of some of the nodes.

> Once we corrected the cluster though, our applications
> still did not respond and we had to restart all of our clients.
> 
> Our clients all have a lot of handling for connection drops and
> channel closures, but most of them did not see any TCP disconnects to
> their respective nodes.  When the cluster was fixed, we found a lot of
> our queues missing (they weren't durable), and so we had to restart
> all of the apps to redeclare the queues.  This still didn't fix our
> installation though, as our apps were receiving and processing data,
> but responses were not being sent back out of our HTTP translators.
> 
> We have a single exchange, "response" that any application expecting a
> response can bind to.  Our HTTP translators handle traffic from our
> public endpoints, publish to various exchanges for the services we
> offer, and those services in turn write back to the response exchange.
>  We have a monitoring tool that confirmed that these translators could
> write a response to its own Rabbit host and immediately receive it (a
> ping, more or less).  However, none of the responses from services
> which were connected to other Rabbit nodes were received by the
> translators.
> 
> In short, it appeared that even though the cluster was healed and all
> our services had re-declared their queues, the bindings between the
> response exchange and the queues which our translators use did not
> appear to propagate to the rest of the nodes in the cluster.

That doesn't sound right.  As you say, if the cluster was indeed
running, the queues/exhanges/bindings should have appeared on all of the
nodes.

It's possible that the rabbit nodes reconnected succesfully, but the
mnesia ones didn't.  When a rabbitmq node detects another has gone
down, it automatically removes the queues declared on it from the
cluster.  If the rabbit nodes think everything is fine, this removal
wouldn't happen.  As a result, rabbitmqctl might report
queues/exchanges/bindings that are actually unusable.

> So in summary,
> 
> * Rabbit didn't re-connect to the other nodes after the second TCP disconnect

We don't have any logic in the broker to recover from
inconsistent_database errors.  Your best bet is probably to restart
all but one of the nodes.

> * After fixing the cluster (manually or automatically), Rabbit appears
> to have lost its non-durable queues even though the nodes never
> stopped
> * Although we had every indication that exchanges and queues were
> still alive and functional, bindings appear to have been lost between
> Rabbit nodes

See above.  The cluster may not have been completely repaired.  Try
restarting.

> What we'd like to know is,
> 
> * Does any of this make sense and can we add more detail to help fix any bugs?

It makes some sense.  Thanks for pointing this problem out.

> * Have there been fixes for these issues since 1.7.2 that we should deploy?

Not to this, sorry.

> * Is there anything we should add/change about our applications to
> deal with these types of situations?

I'm not sure what you could do to prevent this.  This is more of a
mnesia problem.

Cheers,
Alex