[rabbitmq-discuss] Recurring partitioning problem on local network

Wed Dec 11 10:19:08 GMT 2013

On 11/12/13 03:19, Bill Chmura wrote:
> One of our sets went down today
>
> Both nodes basically have this, just naming the other node:
>
> =INFO REPORT==== 10-Dec-2013::18:29:24 ===
> rabbit on node 'rabbit at NURWEB-QAWEB01' down
>
> =ERROR REPORT==== 10-Dec-2013::18:29:35 ===
> Mnesia('rabbit at NURWEB-QAWEB02'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit at NURWEB-QAWEB01'}
>
> =INFO REPORT==== 10-Dec-2013::18:29:47 ===
> node 'rabbit at NURWEB-QAWEB01' down: connection_closed
>
> Not much more info with the patched base file... does this help at all?

Somewhat, yes. The interesting bit is the "connection_closed" part. This 
means that the net_ticktime-based timeout is not happening - something 
is closing the TCP connection between the two hosts. That would explain 
why it comes back again immediately.

Do you have some sort of firewall or other network infrastructure that 
could be forcible closing this connection?

> I tried searching and got a lot on connection closed abruptly... but it did not sound right.

No, that's a different thing: we log "connection closed abruptly" when 
AMQP connections go away without going through the AMQP close handshake.

Cheers, Simon

-- 
Simon MacMullen
RabbitMQ, Pivotal