<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=us-ascii"><meta name=Generator content="Microsoft Word 14 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:Tahoma;
        panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
        {mso-style-priority:99;
        mso-style-link:"Balloon Text Char";
        margin:0in;
        margin-bottom:.0001pt;
        font-size:8.0pt;
        font-family:"Tahoma","sans-serif";}
span.EmailStyle17
        {mso-style-type:personal-compose;
        font-family:"Calibri","sans-serif";
        color:windowtext;}
span.BalloonTextChar
        {mso-style-name:"Balloon Text Char";
        mso-style-priority:99;
        mso-style-link:"Balloon Text";
        font-family:"Tahoma","sans-serif";}
.MsoChpDefault
        {mso-style-type:export-only;
        font-family:"Calibri","sans-serif";}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=EN-US link=blue vlink=purple><div class=WordSection1><p class=MsoNormal>Hi,<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>I am running a RabbitMQ cluster with two nodes and they continue to periodically experience a network partition. They are physically located in the same data center and their network should be reliable. When I check their logs, both servers report the “running_partitioned_network” error at about the same time and both nodes continue running, so I don’t think it is a hardware failure or one of the nodes terminating unexpectedly. I modified the net_ticktime to 120 seconds to try to mitigate the problem, and it stopped occurring for almost a month, but it recently started occurring again once every few days. Now I am not sure if the net_ticktime helped or if it was just coincidence.<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>In order to troubleshoot further, I started a rolling network trace using Wireshark and used a scheduled task to halt the trace when the nodes became partitioned again. My goal is to determine whether the partition is caused by unreliable network, or if the application failed to respond. Nothing in the packet trace jumps out as showing a network failure, there are only a handful of TCP retransmissions and plenty of other packets are sent successfully between them. <o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>At this point I am not sure what else to look at in the packet trace to either prove or disprove that the network caused the failure. Wireshark can identify and decode the Erlang Distribution Protocol, but I don’t know how to interpret the messages to know what would cause nodes to detect a partition. Also, the net_ticktime is set to 120 seconds, and I do not see a 120 second gap in the servers receiving messages from each other. The longest gap in which no Erlang messages are received from the other server is 22 seconds (much less if you count the TCP acknowledgements). My only other thought is that if a particular “ping” type message needs to be sent between the nodes and that particular messages was interrupted, but I don’t know what that would look like in the trace.<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>Any ideas on how to diagnose the cause of a network partition would be appreciated.<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>Thanks,<o:p></o:p></p><p class=MsoNormal>-Nick Slowes<o:p></o:p></p></div></body></html>