[rabbitmq-discuss] Diagnosing RabbitMQ Network Partition Problem

Nick Slowes nslowes at overdrive.com
Mon Mar 18 18:02:43 GMT 2013


Hi,

 

I am running a RabbitMQ cluster with two nodes and they continue to
periodically experience a network partition.  They are physically
located in the same data center and their network should be reliable.
When I check their logs, both servers report the
"running_partitioned_network" error at about the same time and both
nodes continue running, so I don't think it is a hardware failure or one
of the nodes terminating unexpectedly.  I modified the net_ticktime to
120 seconds to try to mitigate the problem, and it stopped occurring for
almost a month, but it recently started occurring again once every few
days.  Now I am not sure if the net_ticktime helped or if it was just
coincidence.

 

In order to troubleshoot further, I started a rolling network trace
using Wireshark and used a scheduled task to halt the trace when the
nodes became partitioned again.  My goal is to determine whether the
partition is caused by unreliable network, or if the application failed
to respond.  Nothing in the packet trace jumps out as showing a network
failure, there are only a handful of TCP retransmissions and plenty of
other packets are sent successfully between them.  

 

At this point I am not sure what else to look at in the packet trace to
either prove or disprove that the network caused the failure.  Wireshark
can identify and decode the Erlang Distribution Protocol, but I don't
know how to interpret the messages to know what would cause nodes to
detect a partition.  Also, the net_ticktime is set to 120 seconds, and I
do not see a 120 second gap in the servers receiving messages from each
other.  The longest gap in which no Erlang messages are received from
the other server is 22 seconds (much less if you count the TCP
acknowledgements).  My only other thought is that if a particular "ping"
type message needs to be sent between the nodes and that particular
messages was interrupted, but I don't know what that would look like in
the trace.

 

Any ideas on how to diagnose the cause of a network partition would be
appreciated.

 

Thanks,

-Nick Slowes

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130318/d41dddfb/attachment.htm>


More information about the rabbitmq-discuss mailing list