<div dir="ltr"><div><div>Hmmm... So, about 1/4 of the queues were approaching being balanced, and then all of the hosts dropped out of the test cluster.<br><br></div>Looking at the system, only the second instance is even listening on any network ports, but the beam.smp processes for all instances are still running:<br>

<br># netstat -tlpn | grep beam<br>tcp        0      0 <a href="http://0.0.0.0:44442">0.0.0.0:44442</a>               0.0.0.0:*                   LISTEN      20303/beam.smp      <br>tcp        0      0 <a href="http://0.0.0.0:16639">0.0.0.0:16639</a>               0.0.0.0:*                   LISTEN      20303/beam.smp      <br>

tcp        0      0 :::5672                     :::*                        LISTEN      20303/beam.smp      <br><br></div># ps axf<br><div>12449 ?        S      0:08 /usr/lib64/erlang/erts-5.8.5/bin/epmd -daemon<br>20183 ?        Sl    13:50 /usr/lib64/erlang/erts-5.8.5/bin/beam.smp -W w -K true -A30 -P 1048576 -- -root /usr/lib64/erlang -progname erl -- -home /var/lib/rabbitmq -- -p<br>

20303 ?        Sl    21:31 /usr/lib64/erlang/erts-5.8.5/bin/beam.smp -W w -K true -A30 -P 1048576 -- -root /usr/lib64/erlang -progname erl -- -home /var/lib/rabbitmq -- -p<br>20739 ?        Ss     0:04  \_ inet_gethost 4<br>

20740 ?        S      0:04      \_ inet_gethost 4<br>20423 ?        Sl    11:24 /usr/lib64/erlang/erts-5.8.5/bin/beam.smp -W w -K true -A30 -P 1048576 -- -root /usr/lib64/erlang -progname erl -- -home /var/lib/rabbitmq -- -p<br>

20543 ?        Sl    11:41 /usr/lib64/erlang/erts-5.8.5/bin/beam.smp -W w -K true -A30 -P 1048576 -- -root /usr/lib64/erlang -progname erl -- -home /var/lib/rabbitmq -- -p<br>20663 ?        Sl    10:18 /usr/lib64/erlang/erts-5.8.5/bin/beam.smp -W w -K true -A30 -P 1048576 -- -root /usr/lib64/erlang -progname erl -- -home /var/lib/rabbitmq -- -p<br>

<br><div><div></div></div></div><div class="gmail_extra">I killed them all and started them back up again, the spent a long time rescanning all their queues, and when the main node (which the queues were created on) re-entered the cluster, they all crashed again. I blew away the test cluster, and re-ran the following steps, and had the main node where all the queues were created and the data loaded stop listening on all its ports and disconnect from the rest of the cluster. Killing it and restarting it caused it to reread all its queues, and then drop all its network listeners and connections to the cluster again.<br>

<br>./create_cluster.sh &amp;&amp; ./setup_queues.sh &amp;&amp; ./populate_queues.sh<br></div><div class="gmail_extra">./rebalance_cluster.sh<br><br></div><div class="gmail_extra">Looking at the logs for instances which drop their network connections shows some pretty big erlang stack dumps, so I&#39;ve compressed and uploaded a sample log file here for you: <a href="https://mega.co.nz/#!31JCGSxa!LdxuhXeX_HN_px8lnQcp63RkFcRY_dW9Z_x9C7qv2aE">https://mega.co.nz/#!31JCGSxa!LdxuhXeX_HN_px8lnQcp63RkFcRY_dW9Z_x9C7qv2aE</a><br>

</div><div class="gmail_extra"><br></div><div class="gmail_extra">Here&#39;s my latest set of scripts for reproduction at your end: <a href="https://mega.co.nz/#!Th5UCBZS!eQe9_SmOS5qdv0tP8nUZmnjj7h1QhtOG1i4GrtoYeMU">https://mega.co.nz/#!Th5UCBZS!eQe9_SmOS5qdv0tP8nUZmnjj7h1QhtOG1i4GrtoYeMU</a><br>

</div><div class="gmail_extra"><br></div>Graeme<br><div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Nov 26, 2013 at 1:28 PM, Graeme N <span dir="ltr">&lt;<a href="mailto:graeme@sudo.ca" target="_blank">graeme@sudo.ca</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">On Tue, Nov 26, 2013 at 1:13 PM, Graeme N <span dir="ltr">&lt;<a href="mailto:graeme@sudo.ca" target="_blank">graeme@sudo.ca</a>&gt;</span> wrote:<br>


<div class="im"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra">

<div><div class="gmail_quote">On Mon, Nov 25, 2013 at 9:39 AM, Simon MacMullen <span dir="ltr">&lt;<a href="mailto:simon@rabbitmq.com" target="_blank">simon@rabbitmq.com</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Graeme: we recently merged a few fixes to bugs that we found as a result of running your test scripts; these can be picked up in the latest nightlies. I am now able to run rebalance_cluster.sh and toggle_nodes.sh indefinitely without anything breaking.<br>


</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

Thanks for producing these scripts!<br></blockquote></div></div></div></div></blockquote></div></div></div></div>

</blockquote></div><br></div></div></div>