[rabbitmq-discuss] Diagnosing Network partition false positives

Mon Feb 10 11:54:15 GMT 2014

On 10/02/14 11:36, Patrick Long wrote:
> When I checked our testing environments this morning I saw that one of
> them was reporting a Suspected Network Partition.
>
> Both nodes are virtual machines on the same network so I don't think
> "network partition" is a valid error.

As far as RabbitMQ is concerned, it's a real network partition. The two 
nodes did lose contact with each other.

In a virtualised environment it might be possible to provoke this error 
by (for example) suspending one of the machines.

> Any suggestions on how best to look into this?

Future releases of RabbitMQ (3.3.0, currently in the nightly builds) 
will log the reason why one node decided another was down. Unfortunately 
that's not available in 3.2.x, so I am afraid RabbitMQ is not going to 
help you determine what caused the partition.

I am suspicious though that the partition lasted about 10 seconds, and 
happened at almost exactly 1AM. 1AM is the sort of time when scheduled 
tasks happen - it might be worth looking at what was happening around then.

> Shouldn't the aliveness test flag up on one of the nodes that there is a
> problem? During this time both reported {200:OK}

The aliveness test is just there to check that the node is alive, not 
that there are no problems with the cluster. You can check for network 
partitions via the HTTP API by looking at /api/nodes and checking if any 
node has a non-empty 'partitions' list.

Cheers, Simon

-- 
Simon MacMullen
RabbitMQ, Pivotal