[rabbitmq-discuss] Upgrade fail

Wed Apr 9 00:48:51 BST 2014

>
>   I've tried to upgrade my cluster to 3.3.0.
>>
>
> <snip>
>
> As I've wrote before, I did a clean install, to drop the inherited
possibly damaged mnesia files, so the previous installation can not be
tested. But of course I've got all the logs.

>> This is quite weird. The diagnostics indicate that there actually is some
> process up, registered with epmd and listening on the Erlang distribution
> port for both node2 and node3.

I've seen this quite a few times.

Just starting and stopping rabbitmq-server leaves one running epmd process
in memory! I've noticed it before, but it looks like it's intended.

For example (after the fresh rabbit install):

petros at node3:~$ ps aux|grep rabbitmq
rabbitmq  1223  0.0  0.0   7496   320 ?        S    00:41   0:00
/usr/lib/erlang/erts-5.10.2/bin/epmd -daemon
rabbitmq  2228  0.0  0.0   4440   624 ?        S    00:42   0:00 /bin/sh
/usr/sbin/rabbitmq-server
rabbitmq  2268  2.1  0.6 1002616 51564 ?       Sl   00:42   0:13
/usr/lib/erlang/erts-5.10.2/bin/beam.smp -W w -K true -A30 -P 1048576 --
-root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa
/usr/lib/rabbitmq/lib/rabbitmq_server-3.3.0/sbin/../ebin -noshell -noinput
-s rabbit boot -sname rabbit at node3 -boot start_sasl -config
/etc/rabbitmq/rabbitmq -kernel inet_default_connect_options
[{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false
-rabbit error_logger {file,"/var/log/rabbitmq/rabbit at node3.log"} -rabbit
sasl_error_logger {file,"/var/log/rabbitmq/rabbit at node3-sasl.log"} -rabbit
enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir
"/usr/lib/rabbitmq/lib/rabbitmq_server-3.3.0/sbin/../plugins" -rabbit
plugins_expand_dir "/var/lib/rabbitmq/mnesia/rabbit at node3-plugins-expand"
-os_mon start_cpu_sup false -os_mon start_disksup false -os_mon
start_memsup false -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit at node3"
-kernel inet_dist_listen_min 25672 -kernel inet_dist_listen_max 25672
rabbitmq  2368  0.0  0.0   7460   428 ?        Ss   00:42   0:00
inet_gethost 4
rabbitmq  2369  0.0  0.0  13788   832 ?        S    00:42   0:00
inet_gethost 4
petros    5381  0.0  0.0  11196   948 pts/0    S+   00:52   0:00 grep
--color=auto rabbitmq
petros at node3:~$ sudo service rabbitmq-server stop
 * Stopping message broker rabbitmq-server

                                       [ OK ]
petros at node3:~$ ps aux|grep rabbitmq
rabbitmq  1223  0.0  0.0   7496   320 ?        S    00:41   0:00
/usr/lib/erlang/erts-5.10.2/bin/epmd -daemon
petros    5505  0.0  0.0  11196   944 pts/0    S+   00:53   0:00 grep
--color=auto rabbitmq
petros at node3:~$

So after stopping, one erlang process still runs, who knows why. :)

petros at node3:~$ sudo lsof |grep epmd
epmd      1223        rabbitmq  cwd       DIR                8,1     4096
       2 /
epmd      1223        rabbitmq  rtd       DIR                8,1     4096
       2 /
epmd      1223        rabbitmq  txt       REG                8,1    51856
  936647 /usr/lib/erlang/erts-5.10.2/bin/epmd
epmd      1223        rabbitmq  mem       REG                8,1  1853400
  394057 /lib/x86_64-linux-gnu/libc-2.17.so
epmd      1223        rabbitmq  mem       REG                8,1  1063328
  394030 /lib/x86_64-linux-gnu/libm-2.17.so
epmd      1223        rabbitmq  mem       REG                8,1   149312
  394033 /lib/x86_64-linux-gnu/ld-2.17.so
epmd      1223        rabbitmq    0r      CHR                1,3      0t0
    1029 /dev/null
epmd      1223        rabbitmq    1w      CHR                1,3      0t0
    1029 /dev/null
epmd      1223        rabbitmq    2w      CHR                1,3      0t0
    1029 /dev/null
epmd      1223        rabbitmq    3u     IPv4              10155      0t0
     TCP *:epmd (LISTEN)

and it's listening too...

>
>  WHY is it waiting for node2, node3, if node1 was the last to stop, and
>> it should come online in itself without them?
>>
>
> Very good question.
>
> I was able to reproduce this precise error message by doing the equivalent
> of:
>
> (stop node1)
> rabbitmqctl -n node2 stop_app
> rabbitmqctl -n node3 stop_app
> (start node1)
>
> i./e. leaving the Erlang VM running but stopping RabbitMQ (and Mnesia) on
> node2 and node3.
>
> Is there any possibility at all you could have done something like this?
> Is there a beam.smp process running on those nodes?
>
I don't run rabbitmqctl stop_app commands by hand, I've got these things
(creating a cluster, changing nodetype) scripted, so no, that's not the
case. Even if I run something that needs stop_app, there was a start_app in
a second.

I am especially puzzled since the nodes have registered 25672 as a
> distribution port - prior to 3.3.0 the distribution port would be chosen at
> random, or you could configure it. So the presence of 25672 strongly
> indicates that those nodes have been upgraded to 3.3.0 and are running an
> Erlang VM - with RabbitMQ stopped?

I think that the standard apt-get process is doing something like this:
service rabbitmq-server stop
upgrade
service rabbitmq-server start

but as we've already identified rabbitmq-server stop does leave an epmd
running, as it did with 3.1.3, 3.2.2, 3.2.3, 3.2.4, 3.3.0

so after upgradeing the epmd isn't fresh, it may have previous settings,
apps running.

>
>  The cookie is the same, the hostnames are correct.
>>
>
> If I am correct about the diagnosis above then the diagnostics don't
> currently handle this situation well; they don't check for the case where
> Erlang is running but RabbitMQ isn't. This will get fixed in the next
> release.

Intensive logging would help too.

>
>   I'm trying to start node2 and node3 to have at least one running but
>> they always fail, even if I copied back the previous disk state, and
>> upgraded again...
>>
>
> How do they fail? Because it sounds like something is running.

 Node2 and node3 always exited with the message that they are not the one
stopped the last, and if I removed the locks they tried to connect to each
other - while repairing the mnesia db, and as the network connect failed
(as they were not accepting connections probably because the db repair was
on), they quit, leaving the repair in half. So the nodes were not able to
communicate because of the db repair (this is a theory only), and as the
timeout arrived they quit before the repair would finish. (It would be nice
to have a message like "starting repair" and "repair finished", as the log
message currently is not clear about the state of the repair, I'm not sure
that if the repair ends I'll get a logitem.)

==== cut here ====

At this very moment I have the following situation (after the clean 3.3.0
install on all nodes, and cluster rebuilt):
node1,node2,node3 is after an apt-get upgrade, and needs restarting

node1 restart done, fine.
node3 restart done, fine.

node2: sudo reboot, and it's just waiting forever (without the usual
conneciton closed by foreign host)

root at node2:~# rabbitmqctl status
Status of node rabbit at node2 ...
Error: unable to connect to node rabbit at node2: nodedown

DIAGNOSTICS
===========

attempted to contact: [rabbit at node2]

rabbit at node2:
  * rabbit seems not to be running at all
  * other nodes on node2: [rabbitmqctl13579]

current node details:
- node name: rabbitmqctl14147 at node2
- home dir: /var/lib/rabbitmq
- cookie hash: XSquoZCiCPIfZlTilwy1Dg==

....

root at node2:~# ps aux|grep rabbit
rabbitmq  1221  0.0  0.0   9416  2372 ?        S    Apr02   0:10
/usr/lib/erlang/erts-5.10.2/bin/epmd -daemon
rabbitmq  2189  0.0  0.0   4440   624 ?        S    Apr02   0:00 /bin/sh
/usr/sbin/rabbitmq-server
rabbitmq  2236  2.3  0.6 995960 53324 ?        Sl   Apr02 211:08
/usr/lib/erlang/erts-5.10.2/bin/beam.smp -W w -K true -A30 -P 1048576 --
-root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa
/usr/lib/rabbitmq/lib/rabbitmq_server-3.3.0/sbin/../ebin -noshell -noinput
-s rabbit boot -sname rabbit at node2 -boot start_sasl -config
/etc/rabbitmq/rabbitmq -kernel inet_default_connect_options
[{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false
-rabbit error_logger {file,"/var/log/rabbitmq/rabbit at node2.log"} -rabbit
sasl_error_logger {file,"/var/log/rabbitmq/rabbit at node2-sasl.log"} -rabbit
enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir
"/usr/lib/rabbitmq/lib/rabbitmq_server-3.3.0/sbin/../plugins" -rabbit
plugins_expand_dir "/var/lib/rabbitmq/mnesia/rabbit at node2-plugins-expand"
-os_mon start_cpu_sup false -os_mon start_disksup false -os_mon
start_memsup false -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit at node2"
-kernel inet_dist_listen_min 25672 -kernel inet_dist_listen_max 25672
root     13513  0.0  0.0   4440   776 ?        S    00:42   0:00 /bin/sh
/etc/rc6.d/K20rabbitmq-server stop
root     13568  0.0  0.0   4440   680 ?        S    00:42   0:00 /bin/sh
/usr/sbin/rabbitmqctl stop /var/run/rabbitmq/pid
root     13577  0.0  0.0  53204  1652 ?        S    00:42   0:00 su
rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmqctl  "stop"
"/var/run/rabbitmq/pid"
rabbitmq 13578  0.0  0.0   4440   624 ?        Ss   00:42   0:00 sh -c
/usr/lib/rabbitmq/bin/rabbitmqctl  "stop" "/var/run/rabbitmq/pid"
rabbitmq 13579  0.1  0.1 437204 14676 ?        Sl   00:42   0:01
/usr/lib/erlang/erts-5.10.2/bin/beam.smp -- -root /usr/lib/erlang -progname
erl -- -home /var/lib/rabbitmq -- -pa
/usr/lib/rabbitmq/lib/rabbitmq_server-3.3.0/sbin/../ebin -noshell -noinput
-hidden -sname rabbitmqctl13579 -boot start_clean -s rabbit_control_main
-nodename rabbit at node2 -extra stop /var/run/rabbitmq/pid
rabbitmq 13611  0.0  0.0   7460   432 ?        Ss   00:42   0:00
inet_gethost 4
rabbitmq 13612  0.0  0.0   9560   636 ?        S    00:42   0:00
inet_gethost 4
root     14958  0.0  0.0  11196   948 pts/0    S+   01:00   0:00 grep
--color=auto rabbit

NOTE that, we have the "K20rabbitmq-server stop"  running, we have the
rabbitmq processes too, but currently neither rabbitmqctl nor management
plugin is accessible, and the reboot process hung.

This was the very reason about month ago, we first killed the rabbitmq
processes. After that we thought "maybe we corrupted mnesia with the kill,
and we don't know why it hang for the first time".

Unfortunately as the computer is in a "going down to reboot" state, my only
shell is the logged in one (one can't get a new connection while the system
is going down).... but by starting sshd manually I can have this node
powered up for a while.

Log files:

the sasl log on node2 is empty (yeeah)
shutdown_err and startup_err is empty (whoa)
startup_log is normal
shutdown_log
"Stopping and halting node rabbit at node2 ...
"
and it has the timestamp of the reboot command

node2.log:

"=INFO REPORT==== 9-Apr-2014::00:42:31 ===
Stopping RabbitMQ

=INFO REPORT==== 9-Apr-2014::00:42:31 ===
stopped TCP Listener on [::]:5672

....a few closed connections....

=INFO REPORT==== 9-Apr-2014::00:42:32 ===
Halting Erlang VM
"

 so it looks like, that rabbitmq really exited 25 minutes ago, but the stop
script won't exit, as the processes still running.

 Could the problem be the inet_gethost process? Unfortunately I've seen dns
problems in our networks, so that could be a problem, but that should not
hang the shutdown.

 The running rabbit processes and their open files are here:

 http://pastebin.com/mLKf5mEu

root at node2:~# pstree
init─┬─dbus-daemon
...
     ├─epmd
     ├─rabbitmq-server───beam.smp───38*[{beam.smp}]

 ├─rc───K20rabbitmq-ser───rabbitmqctl───su───sh───beam.smp─┬─inet_gethost───inet_gethost
     │
└─18*[{beam.smp}]
...

So, I'm looking to find the hangup process... straceing and killing them.
The first try is the gethosts:

root at node2:~# strace -p 13612
Process 13612 attached
read(0,

....

^CProcess 13612 detached
 <detached ...>
root at node2:~# strace -p 13611
Process 13611 attached
select(4, [0 3], NULL, NULL, NULL
....
^CProcess 13611 detached

 <detached ...>
root at node2:~# kill 13612
root at node2:~# kill 13611

weee, after this second kill we've got the inet_gethost respawned, so the
beam.smp (pid 13579) still works (but it should have already quit :) )

We know that some from these are waiting:

├─rc(13415)───K20rabbitmq-ser(13513)───rabbitmqctl(13568)───su(13577)───sh(13578)───beam.smp(13579)─┬─inet_gethost(16346)
        │
                                ├─{beam.smp}(13591)
        │
                                ├─{beam.smp}(13592)
        │
                                ├─{beam.smp}(13593)
...
        │
                                └─{beam.smp}(13608)

 Any ideas? How can I debug this?
 Specific questions?

 I will leave this system in this state for now, to allow us to diagnose
the situation. Unfortunately I don't know erlang, but I can try anything if
I can help solve these hangups. As I've said this is the same behaviour as
3.1.3, 3.2.2, 3.2.3 and so on worked, they sometimes just hung, then we
killed them and had that bad feeling in our hearts, that we may broke the
db. I won't kill this time. :)

 If you can "remote control" me, I may even try to erlang debug the
processes, to guess what they're waiting for.

 Thank you for your help, I hope these hangups could get eliminated.

 Have a nice day,

 Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140409/16298824/attachment.html>