<div dir="ltr">I upgraded from 3.0.4 to 3.1.5 and from esl-erlang R15B03 to esl-erlang R16B01. I've been doing some digging and can't find any reason why this would happen, no additional log files, nothing. Note there are three nodes involved in a cluster. I upgraded my "dvlp" sample box which wasn't clustered with no problem using the exact same script I used to upgrade my alpha cluster. Below is the script I'm using (might be useful for others upgrading a cluster). I'm going to try recreating my "Alpha" environment and redoing the upgrade. The ONLY thing that I can think of - I was doing stop/starts as part of training of the cluster about 20 minutes before I did the upgrade to test various concepts, e.g. message loss, master node recovery, etc. etc. At this point, if people haven't seen or heard of this, I'll chalk it up to something funky disk or otherwise until I can try and replicate it. My biggest concern at this point is that because i no longer have the backup's of the mnesia database, replicating the environment won't replicate exactly what was in the database, and so I won't be able to replicate it. When I start deploying to our production environment, I won't make that mistake - I'll shutdown rabbit, back up my whole rabbit_data_directory first :)<div style>
Thanks!<br>Jason</div><div><br></div><div><div># ########################################<br></div><div># 04/22/2013 - JasonMcIntosh - core upgrade script to go through each reported node in a cluster and upgrade it</div>
<div># ########################################</div><div>if [ ! -e /data/rabbitmq ]; then</div><div><span class="" style="white-space:pre">        </span>echo "No rabbit found on $1" > /var/log/rabbit_upgrade.log</div>
<div><span class="" style="white-space:pre">        </span>exit 0</div><div>fi</div><div><br></div><div>export LOG_FILE=/var/log/rabbitmq/upgrade.log</div><div>rm -f $LOG_FILE</div><div><br></div><div>export CLUSTER_STATUS="`rabbitmqctl cluster_status`"</div>
<div>export CLUSTER_STATUS="`echo $CLUSTER_STATUS|tr -d ' \n'`"</div><div>echo "Started at `date`" >> $LOG_FILE</div><div>echo "$CLUSTER_STATUS" >> $LOG_FILE</div><div>echo "" >> $LOG_FILE</div>
<div><br></div><div>getServerFQDN() {</div><div><span class="" style="white-space:pre">        </span>SERVER_FQDN=`echo $1 | awk -F@ '{print $2};'`</div><div><span class="" style="white-space:pre">        </span>FQDN=`nslookup $SERVER_FQDN|grep Name|awk -F\: '{print $2}'|sed 's/ //g'`</div>
<div><span class="" style="white-space:pre">        </span>echo $FQDN</div><div>}</div><div><br></div><div>upgradeRabbitNode() {</div><div><span class="" style="white-space:pre">        </span>SERVER_FQDN=`getServerFQDN $1`</div><div><span class="" style="white-space:pre">        </span>echo "Doing upgrade of $SERVER_FQDN" >> $LOG_FILE</div>
<div><span class="" style="white-space:pre">        </span># The upgrade deploy job actually stops the rabbit server, shouldn't need this, but we'll do it anways</div><div style> # Not exact commands below as I'm using bladelogic internal commands to do these, but the idea should be the same.</div>
<div><span class="" style="white-space:pre">        </span>remote_exec ${SERVER_FQDN} service rabbitmq-server stop >> $LOG_FILE</div><div><span class="" style="white-space:pre">        </span>remote_exec $SERVER_FQDN yum -y erase rabbit* erlang* >> $LOG_FILE</div>
<div> remote_exec $SERVER_FQDN yum -y install rabbitmq-server-3.1.5... >> $LOG_FILE<br></div><div> remote_exec $SERVER_FQDN rabbitmq-plugins enable rabbitmq_management rabbitmq_management_agent rabbitmq_management_visualiser rabbitmq_shovel rabbitmq_shovel_management>> $LOG_FILE<br>
</div><div style> remote_exec $SERVER_FQDN chkconfig rabbitmq-server on</div><div>}</div><div><br></div><div><br></div><div>#Get the cluster nodes and pick the first disk node as the "upgrader" node</div>
<div>export UPGRADER_NODE=`echo "$CLUSTER_STATUS"|awk -F\[ '{print $4}'|awk -F\] '{print $1}'|awk -F, '{print $1}'`</div><div>export UPGRADER_FQDN=`getServerFQDN $UPGRADER_NODE`</div><div>
export NODE_LIST="`echo $CLUSTER_STATUS|awk -F\[ '{sub(/.*running_nodes/,\"\")};1'|awk -F\[ '{print $2}'|awk -F\] '{print $1}'|sed -e 's/,/ /g'`"</div><div><br></div><div>
echo "Node list: $NODE_LIST " >> $LOG_FILE</div><div>echo "Disk node for last upgrade $UPGRADER_NODE">> $LOG_FILE</div><div><br></div><div>if [ "$UPGRADER_NODE" = "" ]; then</div>
<div><span class="" style="white-space:pre">        </span>echo " ** No upgrader node found! EXITING" >> $LOG_FILE</div><div><span class="" style="white-space:pre">        </span>exit -1</div><div>fi</div><div><br></div>
<div><br></div><div>#Shutdown and upgrade all other nodes than the upgrader node</div><div>echo "Doing upgrade of all non upgrade nodes..." >> $LOG_FILE</div><div>for clusterNode in ${NODE_LIST}; do</div><div>
<span class="" style="white-space:pre">        </span>if [ $clusterNode != $UPGRADER_NODE ]; then </div><div><span class="" style="white-space:pre">                </span>upgradeRabbitNode $clusterNode</div><div><span class="" style="white-space:pre">        </span>fi</div>
<div>done</div><div><br></div><div>#Upgrade the upgrader node now.</div><div>echo "Upgrade the core upgrade node ..." >> $LOG_FILE</div><div>upgradeRabbitNode $UPGRADER_NODE</div><div><br></div><div>#NOW start all nodes, starting with the upgrader node.</div>
<div>echo "Starting rabbit on upgrader node..." >> $LOG_FILE</div><div>remote_exec $UPGRADER_FQDN service rabbitmq-server start >> $LOG_FILE</div><div>for clusterNode in ${NODE_LIST}; do</div><div><span class="" style="white-space:pre">        </span>if [ $clusterNode != $UPGRADER_NODE ]; then </div>
<div><span class="" style="white-space:pre">                </span>SERVER_FQDN=`getServerFQDN $clusterNode`</div><div><span class="" style="white-space:pre">                </span>echo "Starting rabbit on NON upgrade nodes..." >> $LOG_FILE</div>
<div><span class="" style="white-space:pre">                </span>remote_exec $SERVER_FQDN service rabbitmq-server start >> $LOG_FILE</div><div><span class="" style="white-space:pre">        </span>fi</div><div>done</div><div><br></div>
<div>#Finally, make sure our HA Policy is applied to all our virtual hosts</div><div><br></div><div>echo "Finished with upgrade..." >> $LOG_FILE</div><div>#Report how we worked out...</div><div>echo "</div>
<div>RESULTS</div><div>"</div><div>cat $LOG_FILE</div><div><br></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, Aug 27, 2013 at 4:58 AM, Emile Joubert <span dir="ltr"><<a href="mailto:emile@rabbitmq.com" target="_blank">emile@rabbitmq.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div class="im">On 23/08/13 23:01, Jason McIntosh wrote:<br>
> A ps auf shows /usr/lib/erlang/erts-5.9.3.1/bin/epmd -daemon as still<br>
> running. SO I'm wondering if that might have an impact.<br>
<br>
</div>Depending on how you upgraded Erlang you may need to stop this process<br>
manually. I'd be surprised if this was the cause of the error though.<br>
<div class="im"><br>
> stop rabbit on server X (upgrader is Z, other node is Y)<br>
> remove all rabbit/erlang RPM's<br>
> Reinstall rabbit software<br>
> Update rabbitmqadmin<br>
> Enable management plugins (just in case)<br>
> Enable auto start.<br>
><br>
> Rinse and repeat on servers Y, then Z and then start bringing them up<br>
> starting with upgrader node. First start Z, then start Y, then start X.<br>
<br>
</div>From which versions did you upgrade?<br>
<div class="im"><br>
> On Fri, Aug 23, 2013 at 4:51 PM, Jason McIntosh wrote:<br>
<br>
> =INFO REPORT==== 23-Aug-2013::15:37:45 ===<br>
> Disk free limit set to 1000MB<br>
<br>
</div>Were there any other log messages in either logfile or console messages<br>
on any nodes in the interval between or near 15:37:45 - 15:37:50?<br>
<div class="im"><br>
> =ERROR REPORT==== 23-Aug-2013::15:37:50 ===<br>
> ** Generic server <0.303.0> terminating<br>
> ** Last message in was {'EXIT',<0.350.0>,normal}<br>
</div>Did you perform the same upgrade in other environments, and the failure<br>
only occurred in one of the environments?<br>
<span class=""><font color="#888888"><br>
<br>
<br>
-Emile<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
</font></span></blockquote></div><br><br clear="all"><div><br></div>-- <br>Jason McIntosh<br><a href="http://mcintosh.poetshome.com/blog/">http://mcintosh.poetshome.com/blog/</a><br>573-424-7612
</div></div></div>