[rabbitmq-discuss] Cluster recovery due to network outages

Wed Aug 4 14:52:13 BST 2010

Alex,

Thank you for your reply.  We're building up our Erlang, Rabbit and
Mnesia knowledge and I'll pass along your reply to the rest of the
team.

cheers,
Aaron

On Wed, Aug 4, 2010 at 9:45 AM, Alexandru Scvortov
<alexandru at rabbitmq.com> wrote:
> Hi Aaron,
>
>> =ERROR REPORT==== 1-Aug-2010::06:13:24 ===
>> ** Node rabbit at caerbannog not responding **
>> ** Removing (timedout) connection **
>>
>> =INFO REPORT==== 1-Aug-2010::06:13:24 ===
>> node rabbit at caerbannog down
>
> As the error message suggests, it means mnesia timed out a connection to
> another node.
>
> There was a discussion about this a while ago
> http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2010-March/006508.html
>
> If you're expecting frequent short outages, you might consider
> tweaking the timeout parameters as described above.
>
>> A short time later the hosts recovered, also as we've seen before:
>>
>> =INFO REPORT==== 1-Aug-2010::06:26:48 ===
>> node rabbit at caerbannog up
>> =ERROR REPORT==== 1-Aug-2010::06:26:48 ===
>> Mnesia(rabbit at bigwig): ** ERROR ** mnesia_event got
>> {inconsistent_database, running_partitioned_network,
>>  rabbit at caerbannog}
>>
>> =ERROR REPORT==== 1-Aug-2010::06:26:48 ===
>> Mnesia(rabbit at bigwig): ** ERROR ** mnesia_event got
>> {inconsistent_database, starting_partitioned_network
>> , rabbit at caerbannog}
>>
>
> During the outage, the nodes were out of contact with each other for
> so long that mnesia is worried about possible inconsistencies.
>
> The simplest solution would be to take down 3 of the nodes and
> restart them.  This should allow them to sync with the fourth.
>
> There's a longer explanation available here.
>
> http://www.erlang.org/doc/apps/mnesia/Mnesia_chap7.html#id2277661
>
>> Another 15 minutes later and the timedout errors were logged and that
>> was the end of the cluster; I think two of the nodes figured out how
>> to connect back to each other, but two others remained on their own.
>> The hosts and nodes themselves never shutdown, and when I restarted
>> just one of the nodes later in the day, the whole cluster rediscovered
>> itself and all appeared to be well (`rabbitmqctl status` was
>> consistent with expectations).
>>
>> So our first problem is that the nodes did not re-cluster after the
>> second outage.
>
> If this was caused by the inconsistent_database errors, there's not
> much you can do apart from a restart of some of the nodes.
>
>> Once we corrected the cluster though, our applications
>> still did not respond and we had to restart all of our clients.
>>
>> Our clients all have a lot of handling for connection drops and
>> channel closures, but most of them did not see any TCP disconnects to
>> their respective nodes.  When the cluster was fixed, we found a lot of
>> our queues missing (they weren't durable), and so we had to restart
>> all of the apps to redeclare the queues.  This still didn't fix our
>> installation though, as our apps were receiving and processing data,
>> but responses were not being sent back out of our HTTP translators.
>>
>> We have a single exchange, "response" that any application expecting a
>> response can bind to.  Our HTTP translators handle traffic from our
>> public endpoints, publish to various exchanges for the services we
>> offer, and those services in turn write back to the response exchange.
>>  We have a monitoring tool that confirmed that these translators could
>> write a response to its own Rabbit host and immediately receive it (a
>> ping, more or less).  However, none of the responses from services
>> which were connected to other Rabbit nodes were received by the
>> translators.
>>
>> In short, it appeared that even though the cluster was healed and all
>> our services had re-declared their queues, the bindings between the
>> response exchange and the queues which our translators use did not
>> appear to propagate to the rest of the nodes in the cluster.
>
> That doesn't sound right.  As you say, if the cluster was indeed
> running, the queues/exhanges/bindings should have appeared on all of the
> nodes.
>
> It's possible that the rabbit nodes reconnected succesfully, but the
> mnesia ones didn't.  When a rabbitmq node detects another has gone
> down, it automatically removes the queues declared on it from the
> cluster.  If the rabbit nodes think everything is fine, this removal
> wouldn't happen.  As a result, rabbitmqctl might report
> queues/exchanges/bindings that are actually unusable.
>
>> So in summary,
>>
>> * Rabbit didn't re-connect to the other nodes after the second TCP disconnect
>
> We don't have any logic in the broker to recover from
> inconsistent_database errors.  Your best bet is probably to restart
> all but one of the nodes.
>
>> * After fixing the cluster (manually or automatically), Rabbit appears
>> to have lost its non-durable queues even though the nodes never
>> stopped
>> * Although we had every indication that exchanges and queues were
>> still alive and functional, bindings appear to have been lost between
>> Rabbit nodes
>
> See above.  The cluster may not have been completely repaired.  Try
> restarting.
>
>> What we'd like to know is,
>>
>> * Does any of this make sense and can we add more detail to help fix any bugs?
>
> It makes some sense.  Thanks for pointing this problem out.
>
>> * Have there been fixes for these issues since 1.7.2 that we should deploy?
>
> Not to this, sorry.
>
>> * Is there anything we should add/change about our applications to
>> deal with these types of situations?
>
> I'm not sure what you could do to prevent this.  This is more of a
> mnesia problem.
>
> Cheers,
> Alex
>

-- 
Aaron Westendorf
Senior Software Engineer
Agora Games
359 Broadway
Troy, NY 12180
Phone: 518.268.1000
aaron at agoragames.com
www.agoragames.com