[rabbitmq-discuss] Diagnosing RabbitMQ Network Partition Problem

Tue Mar 19 09:58:16 GMT 2013

Hi Nick,

On 18 Mar 2013, at 18:02, Nick Slowes wrote:
> I am running a RabbitMQ cluster with two nodes and they continue to periodically experience a network partition.  They are physically located in the same data center and their network should be reliable.  When I check their logs, both servers report the “running_partitioned_network” error at about the same time and both nodes continue running, so I don’t think it is a hardware failure or one of the nodes terminating unexpectedly.

That is almost certainly due to a 'perceived' loss of connectivity - that is to say the Erlang VM thinks there was a partition.

>  I modified the net_ticktime to 120 seconds to try to mitigate the problem, and it stopped occurring for almost a month, but it recently started occurring again once every few days.  Now I am not sure if the net_ticktime helped or if it was just coincidence.
>  
> In order to troubleshoot further, I started a rolling network trace using Wireshark and used a scheduled task to halt the trace when the nodes became partitioned again.  My goal is to determine whether the partition is caused by unreliable network, or if the application failed to respond.  Nothing in the packet trace jumps out as showing a network failure, there are only a handful of TCP retransmissions and plenty of other packets are sent successfully between them. 

Whether or not other packets are sent successfully doesn't determine the outcome of a retry though does it. And the operating system configuration will determine the point at which retransmission failure (as the OS deems it) should result in ETIMEDOUT being returned from socket operations.

>  
> At this point I am not sure what else to look at in the packet trace to either prove or disprove that the network caused the failure.  Wireshark can identify and decode the Erlang Distribution Protocol, but I don’t know how to interpret the messages to know what would cause nodes to detect a partition.  Also, the net_ticktime is set to 120 seconds, and I do not see a 120 second gap in the servers receiving messages from each other.

It's tricky to assist with this diagnosis without seeing the trace output. Are you able to post it, or does it contain private data? If the latter, we (RabbitMQ/VMWare) might be able to sign a confidentiality agreement if that helps. 

>  The longest gap in which no Erlang messages are received from the other server is 22 seconds (much less if you count the TCP acknowledgements).  My only other thought is that if a particular “ping” type message needs to be sent between the nodes and that particular messages was interrupted, but I don’t know what that would look like in the trace.
>  

Also bare in mind that in a RabbitMQ cluster, Erlang nodes may communicate with one another using AMQP and the Erlang distribution protocol and will likely communicate with epmd as well.

> Any ideas on how to diagnose the cause of a network partition would be appreciated.
>  

Without looking at the packet trace, it's hard to suggest anything specific. If you'd like to continue diagnosing this yourself, it might be a good idea to head over to the erlang-questions mailing list and ask about sniffing the distribution protocol using wireshark. If you decide to go down that route, I'll follow you over there and try to assist as best I can.

Cheers,
Tim

> Thanks,
> -Nick Slowes
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss