[rabbitmq-discuss] Diagnosing Network partition false positives
Simon MacMullen
simon at rabbitmq.com
Mon Feb 10 11:54:15 GMT 2014
On 10/02/14 11:36, Patrick Long wrote:
> When I checked our testing environments this morning I saw that one of
> them was reporting a Suspected Network Partition.
>
> Both nodes are virtual machines on the same network so I don't think
> "network partition" is a valid error.
As far as RabbitMQ is concerned, it's a real network partition. The two
nodes did lose contact with each other.
In a virtualised environment it might be possible to provoke this error
by (for example) suspending one of the machines.
> Any suggestions on how best to look into this?
Future releases of RabbitMQ (3.3.0, currently in the nightly builds)
will log the reason why one node decided another was down. Unfortunately
that's not available in 3.2.x, so I am afraid RabbitMQ is not going to
help you determine what caused the partition.
I am suspicious though that the partition lasted about 10 seconds, and
happened at almost exactly 1AM. 1AM is the sort of time when scheduled
tasks happen - it might be worth looking at what was happening around then.
> Shouldn't the aliveness test flag up on one of the nodes that there is a
> problem? During this time both reported {200:OK}
The aliveness test is just there to check that the node is alive, not
that there are no problems with the cluster. You can check for network
partitions via the HTTP API by looking at /api/nodes and checking if any
node has a non-empty 'partitions' list.
Cheers, Simon
--
Simon MacMullen
RabbitMQ, Pivotal
More information about the rabbitmq-discuss
mailing list