<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


<div class=""><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">  I've tried to upgrade my cluster to 3.3.0.<br>


</blockquote>

<br></div>

<snip><div class=""><br></div></blockquote><div>As I've wrote before, I did a clean install, to drop the inherited possibly damaged mnesia files, so the previous installation can not be tested. But of course I've got all the logs.</div>


<div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div class="">

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br></blockquote></div>

This is quite weird. The diagnostics indicate that there actually is some process up, registered with epmd and listening on the Erlang distribution port for both node2 and node3.</blockquote><div>I've seen this quite a few times.</div>


<div><br></div><div>Just starting and stopping rabbitmq-server leaves one running epmd process in memory! I've noticed it before, but it looks like it's intended.</div><div><br></div><div>For example (after the fresh rabbit install):</div>


<div><br></div><div>petros@node3:~$ ps aux|grep rabbitmq</div><div>rabbitmq  1223  0.0  0.0   7496   320 ?        S    00:41   0:00 /usr/lib/erlang/erts-5.10.2/bin/epmd -daemon</div><div>rabbitmq  2228  0.0  0.0   4440   624 ?        S    00:42   0:00 /bin/sh /usr/sbin/rabbitmq-server</div>


<div>rabbitmq  2268  2.1  0.6 1002616 51564 ?       Sl   00:42   0:13 /usr/lib/erlang/erts-5.10.2/bin/beam.smp -W w -K true -A30 -P 1048576 -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.3.0/sbin/../ebin -noshell -noinput -s rabbit boot -sname rabbit@node3 -boot start_sasl -config /etc/rabbitmq/rabbitmq -kernel inet_default_connect_options [{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/var/log/rabbitmq/rabbit@node3.log"} -rabbit sasl_error_logger {file,"/var/log/rabbitmq/rabbit@node3-sasl.log"} -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir "/usr/lib/rabbitmq/lib/rabbitmq_server-3.3.0/sbin/../plugins" -rabbit plugins_expand_dir "/var/lib/rabbitmq/mnesia/rabbit@node3-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@node3" -kernel inet_dist_listen_min 25672 -kernel inet_dist_listen_max 25672</div>


<div>rabbitmq  2368  0.0  0.0   7460   428 ?        Ss   00:42   0:00 inet_gethost 4</div><div>rabbitmq  2369  0.0  0.0  13788   832 ?        S    00:42   0:00 inet_gethost 4</div><div>petros    5381  0.0  0.0  11196   948 pts/0    S+   00:52   0:00 grep --color=auto rabbitmq</div>


<div>petros@node3:~$ sudo service rabbitmq-server stop</div><div> * Stopping message broker rabbitmq-server                                                                                                                                                    [ OK ] </div>


<div>petros@node3:~$ ps aux|grep rabbitmq</div><div>rabbitmq  1223  0.0  0.0   7496   320 ?        S    00:41   0:00 /usr/lib/erlang/erts-5.10.2/bin/epmd -daemon</div><div>petros    5505  0.0  0.0  11196   944 pts/0    S+   00:53   0:00 grep --color=auto rabbitmq</div>


<div>petros@node3:~$ </div><div><br></div><div>So after stopping, one erlang process still runs, who knows why. :)</div><div><br></div><div><div>petros@node3:~$ sudo lsof |grep epmd</div><div>epmd      1223        rabbitmq  cwd       DIR                8,1     4096          2 /</div>


<div>epmd      1223        rabbitmq  rtd       DIR                8,1     4096          2 /</div><div>epmd      1223        rabbitmq  txt       REG                8,1    51856     936647 /usr/lib/erlang/erts-5.10.2/bin/epmd</div>


<div>epmd      1223        rabbitmq  mem       REG                8,1  1853400     394057 /lib/x86_64-linux-gnu/<a href="http://libc-2.17.so">libc-2.17.so</a></div><div>epmd      1223        rabbitmq  mem       REG                8,1  1063328     394030 /lib/x86_64-linux-gnu/<a href="http://libm-2.17.so">libm-2.17.so</a></div>


<div>epmd      1223        rabbitmq  mem       REG                8,1   149312     394033 /lib/x86_64-linux-gnu/<a href="http://ld-2.17.so">ld-2.17.so</a></div><div>epmd      1223        rabbitmq    0r      CHR                1,3      0t0       1029 /dev/null</div>


<div>epmd      1223        rabbitmq    1w      CHR                1,3      0t0       1029 /dev/null</div><div>epmd      1223        rabbitmq    2w      CHR                1,3      0t0       1029 /dev/null</div><div>epmd      1223        rabbitmq    3u     IPv4              10155      0t0        TCP *:epmd (LISTEN)</div>


</div><div><br></div><div>and it's listening too...</div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


<div class="">

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

WHY is it waiting for node2, node3, if node1 was the last to stop, and<br>

it should come online in itself without them?<br>

</blockquote>

<br></div>

Very good question.<br>

<br>

I was able to reproduce this precise error message by doing the equivalent of:<br>

<br>

(stop node1)<br>

rabbitmqctl -n node2 stop_app<br>

rabbitmqctl -n node3 stop_app<br>

(start node1)<br>

<br>

i./e. leaving the Erlang VM running but stopping RabbitMQ (and Mnesia) on node2 and node3.<br>

<br>

Is there any possibility at all you could have done something like this? Is there a beam.smp process running on those nodes?<br></blockquote><div>I don't run rabbitmqctl stop_app commands by hand, I've got these things (creating a cluster, changing nodetype) scripted, so no, that's not the case. Even if I run something that needs stop_app, there was a start_app in a second.</div>


<div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

I am especially puzzled since the nodes have registered 25672 as a distribution port - prior to 3.3.0 the distribution port would be chosen at random, or you could configure it. So the presence of 25672 strongly indicates that those nodes have been upgraded to 3.3.0 and are running an Erlang VM - with RabbitMQ stopped?</blockquote>


<div>I think that the standard apt-get process is doing something like this:</div><div>service rabbitmq-server stop</div><div>upgrade</div><div>service rabbitmq-server start</div><div><br></div><div>but as we've already identified rabbitmq-server stop does leave an epmd running, as it did with 3.1.3, 3.2.2, 3.2.3, 3.2.4, 3.3.0 </div>


<div><br></div><div>so after upgradeing the epmd isn't fresh, it may have previous settings, apps running.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


<div class="">

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

The cookie is the same, the hostnames are correct.<br>

</blockquote>

<br></div>

If I am correct about the diagnosis above then the diagnostics don't currently handle this situation well; they don't check for the case where Erlang is running but RabbitMQ isn't. This will get fixed in the next release.</blockquote>


<div>Intensive logging would help too. </div><div><br></div><div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


<div class=""><br class=""><br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">  I'm trying to start node2 and node3 to have at least one running but<br>


they always fail, even if I copied back the previous disk state, and<br>upgraded again...<br></blockquote><br></div>How do they fail? Because it sounds like something is running.</blockquote><div> Node2 and node3 always exited with the message that they are not the one stopped the last, and if I removed the locks they tried to connect to each other - while repairing the mnesia db, and as the network connect failed (as they were not accepting connections probably because the db repair was on), they quit, leaving the repair in half. So the nodes were not able to communicate because of the db repair (this is a theory only), and as the timeout arrived they quit before the repair would finish. (It would be nice to have a message like "starting repair" and "repair finished", as the log message currently is not clear about the state of the repair, I'm not sure that if the repair ends I'll get a logitem.)</div>


<div><br></div></div><div><br></div><div>==== cut here ====</div><div><br></div><div>At this very moment I have the following situation (after the clean 3.3.0 install on all nodes, and cluster rebuilt):</div><div>node1,node2,node3 is after an apt-get upgrade, and needs restarting</div>


<div><br></div><div>node1 restart done, fine.</div><div>node3 restart done, fine.</div><div><br></div><div>node2: sudo reboot, and it's just waiting forever (without the usual conneciton closed by foreign host)</div>

<div>

<br></div><div>root@node2:~# rabbitmqctl status</div><div>Status of node rabbit@node2 ...</div><div>Error: unable to connect to node rabbit@node2: nodedown</div><div><br></div><div>DIAGNOSTICS</div><div>===========</div>

<div>

<br></div><div>attempted to contact: [rabbit@node2]</div><div><br></div><div>rabbit@node2:</div><div>  * rabbit seems not to be running at all</div><div>  * other nodes on node2: [rabbitmqctl13579]</div><div><br></div><div>


current node details:</div><div>- node name: rabbitmqctl14147@node2</div><div>- home dir: /var/lib/rabbitmq</div><div>- cookie hash: XSquoZCiCPIfZlTilwy1Dg==</div><div><br></div><div>....</div><div><br></div><div>root@node2:~# ps aux|grep rabbit</div>


<div>rabbitmq  1221  0.0  0.0   9416  2372 ?        S    Apr02   0:10 /usr/lib/erlang/erts-5.10.2/bin/epmd -daemon</div><div>rabbitmq  2189  0.0  0.0   4440   624 ?        S    Apr02   0:00 /bin/sh /usr/sbin/rabbitmq-server</div>


<div>rabbitmq  2236  2.3  0.6 995960 53324 ?        Sl   Apr02 211:08 /usr/lib/erlang/erts-5.10.2/bin/beam.smp -W w -K true -A30 -P 1048576 -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.3.0/sbin/../ebin -noshell -noinput -s rabbit boot -sname rabbit@node2 -boot start_sasl -config /etc/rabbitmq/rabbitmq -kernel inet_default_connect_options [{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/var/log/rabbitmq/rabbit@node2.log"} -rabbit sasl_error_logger {file,"/var/log/rabbitmq/rabbit@node2-sasl.log"} -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir "/usr/lib/rabbitmq/lib/rabbitmq_server-3.3.0/sbin/../plugins" -rabbit plugins_expand_dir "/var/lib/rabbitmq/mnesia/rabbit@node2-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@node2" -kernel inet_dist_listen_min 25672 -kernel inet_dist_listen_max 25672</div>


<div>root     13513  0.0  0.0   4440   776 ?        S    00:42   0:00 /bin/sh /etc/rc6.d/K20rabbitmq-server stop</div><div>root     13568  0.0  0.0   4440   680 ?        S    00:42   0:00 /bin/sh /usr/sbin/rabbitmqctl stop /var/run/rabbitmq/pid</div>


<div>root     13577  0.0  0.0  53204  1652 ?        S    00:42   0:00 su rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmqctl  "stop" "/var/run/rabbitmq/pid"</div><div>rabbitmq 13578  0.0  0.0   4440   624 ?        Ss   00:42   0:00 sh -c /usr/lib/rabbitmq/bin/rabbitmqctl  "stop" "/var/run/rabbitmq/pid"</div>


<div>rabbitmq 13579  0.1  0.1 437204 14676 ?        Sl   00:42   0:01 /usr/lib/erlang/erts-5.10.2/bin/beam.smp -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.3.0/sbin/../ebin -noshell -noinput -hidden -sname rabbitmqctl13579 -boot start_clean -s rabbit_control_main -nodename rabbit@node2 -extra stop /var/run/rabbitmq/pid</div>


<div>rabbitmq 13611  0.0  0.0   7460   432 ?        Ss   00:42   0:00 inet_gethost 4</div><div>rabbitmq 13612  0.0  0.0   9560   636 ?        S    00:42   0:00 inet_gethost 4</div><div>root     14958  0.0  0.0  11196   948 pts/0    S+   01:00   0:00 grep --color=auto rabbit</div>


<div><br></div><div><br></div><div>NOTE that, we have the "K20rabbitmq-server stop"  running, we have the rabbitmq processes too, but currently neither rabbitmqctl nor management plugin is accessible, and the reboot process hung.<br>


</div><div><br></div><div>This was the very reason about month ago, we first killed the rabbitmq processes. After that we thought "maybe we corrupted mnesia with the kill, and we don't know why it hang for the first time".</div>


<div><br></div><div>Unfortunately as the computer is in a "going down to reboot" state, my only shell is the logged in one (one can't get a new connection while the system is going down).... but by starting sshd manually I can have this node powered up for a while.</div>


<div><br></div><div>Log files:</div><div><br></div><div>the sasl log on node2 is empty (yeeah)</div><div>shutdown_err and startup_err is empty (whoa)</div><div>startup_log is normal</div><div>shutdown_log </div><div>"Stopping and halting node rabbit@node2 ...</div>


<div>"</div><div>and it has the timestamp of the reboot command</div><div><br></div><div>node2.log:</div><div><br></div><div>"=INFO REPORT==== 9-Apr-2014::00:42:31 ===</div><div>Stopping RabbitMQ</div><div><br>

</div>

<div>=INFO REPORT==== 9-Apr-2014::00:42:31 ===</div><div>stopped TCP Listener on [::]:5672</div><div><br></div><div>....a few closed connections....</div><div><br></div><div><div>=INFO REPORT==== 9-Apr-2014::00:42:32 ===</div>


<div>Halting Erlang VM</div></div><div>"</div><div><br></div><div> so it looks like, that rabbitmq really exited 25 minutes ago, but the stop script won't exit, as the processes still running.</div><div><br></div>


<div><br></div><div> Could the problem be the inet_gethost process? Unfortunately I've seen dns problems in our networks, so that could be a problem, but that should not hang the shutdown.</div><div><br></div><div><br>


</div><div> The running rabbit processes and their open files are here:</div><div><br></div><div><div> <a href="http://pastebin.com/mLKf5mEu">http://pastebin.com/mLKf5mEu</a></div></div><div><br></div><div><br></div><div>


<div>root@node2:~# pstree</div><div>init─┬─dbus-daemon</div><div>...</div><div>     ├─epmd</div><div>     ├─rabbitmq-server───beam.smp───38*[{beam.smp}]<br></div><div>     ├─rc───K20rabbitmq-ser───rabbitmqctl───su───sh───beam.smp─┬─inet_gethost───inet_gethost</div>


<div>     │                                                         └─18*[{beam.smp}]</div><div>...</div></div><div><br></div><div>So, I'm looking to find the hangup process... straceing and killing them.</div><div>The first try is the gethosts:</div>


<div><br></div><div><div>root@node2:~# strace -p 13612</div><div>Process 13612 attached</div><div>read(0, </div><div><br></div><div>....</div><div><br></div><div>^CProcess 13612 detached</div><div> <detached ...></div>


<div>root@node2:~# strace -p 13611</div><div>Process 13611 attached</div><div>select(4, [0 3], NULL, NULL, NULL</div><div>....</div><div>^CProcess 13611 detached</div><div><br></div><div> <detached ...></div><div><div>


root@node2:~# kill 13612</div></div></div><div><div>root@node2:~# kill 13611</div></div><div><br></div><div>weee, after this second kill we've got the inet_gethost respawned, so the beam.smp (pid 13579) still works (but it should have already quit :) )</div>


<div><br></div><div>We know that some from these are waiting:</div><div><br></div><div><div>        ├─rc(13415)───K20rabbitmq-ser(13513)───rabbitmqctl(13568)───su(13577)───sh(13578)───beam.smp(13579)─┬─inet_gethost(16346)</div>


<div>        │                                                                                                   ├─{beam.smp}(13591)</div><div>        │                                                                                                   ├─{beam.smp}(13592)</div>


<div>        │                                                                                                   ├─{beam.smp}(13593)</div><div>...</div><div>        │                                                                                                   └─{beam.smp}(13608)<br>


</div></div><div><br></div><div><br></div><div> Any ideas? How can I debug this? </div><div> Specific questions? </div><div><br></div><div> I will leave this system in this state for now, to allow us to diagnose the situation. Unfortunately I don't know erlang, but I can try anything if I can help solve these hangups. As I've said this is the same behaviour as 3.1.3, 3.2.2, 3.2.3 and so on worked, they sometimes just hung, then we killed them and had that bad feeling in our hearts, that we may broke the db. I won't kill this time. :)</div>


<div><br></div><div> If you can "remote control" me, I may even try to erlang debug the processes, to guess what they're waiting for.</div><div><br></div><div> Thank you for your help, I hope these hangups could get eliminated. <br>


</div><div><br></div><div> Have a nice day,</div><div><br></div><div> Peter</div><div><br></div><div><br></div><div><br></div></div></div></div>