[rabbitmq-discuss] Upgrade fail
Peter Kopias
kopias.peter at gmail.com
Wed Apr 9 00:48:51 BST 2014
>
> I've tried to upgrade my cluster to 3.3.0.
>>
>
> <snip>
>
> As I've wrote before, I did a clean install, to drop the inherited
possibly damaged mnesia files, so the previous installation can not be
tested. But of course I've got all the logs.
>> This is quite weird. The diagnostics indicate that there actually is some
> process up, registered with epmd and listening on the Erlang distribution
> port for both node2 and node3.
I've seen this quite a few times.
Just starting and stopping rabbitmq-server leaves one running epmd process
in memory! I've noticed it before, but it looks like it's intended.
For example (after the fresh rabbit install):
petros at node3:~$ ps aux|grep rabbitmq
rabbitmq 1223 0.0 0.0 7496 320 ? S 00:41 0:00
/usr/lib/erlang/erts-5.10.2/bin/epmd -daemon
rabbitmq 2228 0.0 0.0 4440 624 ? S 00:42 0:00 /bin/sh
/usr/sbin/rabbitmq-server
rabbitmq 2268 2.1 0.6 1002616 51564 ? Sl 00:42 0:13
/usr/lib/erlang/erts-5.10.2/bin/beam.smp -W w -K true -A30 -P 1048576 --
-root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa
/usr/lib/rabbitmq/lib/rabbitmq_server-3.3.0/sbin/../ebin -noshell -noinput
-s rabbit boot -sname rabbit at node3 -boot start_sasl -config
/etc/rabbitmq/rabbitmq -kernel inet_default_connect_options
[{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false
-rabbit error_logger {file,"/var/log/rabbitmq/rabbit at node3.log"} -rabbit
sasl_error_logger {file,"/var/log/rabbitmq/rabbit at node3-sasl.log"} -rabbit
enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir
"/usr/lib/rabbitmq/lib/rabbitmq_server-3.3.0/sbin/../plugins" -rabbit
plugins_expand_dir "/var/lib/rabbitmq/mnesia/rabbit at node3-plugins-expand"
-os_mon start_cpu_sup false -os_mon start_disksup false -os_mon
start_memsup false -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit at node3"
-kernel inet_dist_listen_min 25672 -kernel inet_dist_listen_max 25672
rabbitmq 2368 0.0 0.0 7460 428 ? Ss 00:42 0:00
inet_gethost 4
rabbitmq 2369 0.0 0.0 13788 832 ? S 00:42 0:00
inet_gethost 4
petros 5381 0.0 0.0 11196 948 pts/0 S+ 00:52 0:00 grep
--color=auto rabbitmq
petros at node3:~$ sudo service rabbitmq-server stop
* Stopping message broker rabbitmq-server
[ OK ]
petros at node3:~$ ps aux|grep rabbitmq
rabbitmq 1223 0.0 0.0 7496 320 ? S 00:41 0:00
/usr/lib/erlang/erts-5.10.2/bin/epmd -daemon
petros 5505 0.0 0.0 11196 944 pts/0 S+ 00:53 0:00 grep
--color=auto rabbitmq
petros at node3:~$
So after stopping, one erlang process still runs, who knows why. :)
petros at node3:~$ sudo lsof |grep epmd
epmd 1223 rabbitmq cwd DIR 8,1 4096
2 /
epmd 1223 rabbitmq rtd DIR 8,1 4096
2 /
epmd 1223 rabbitmq txt REG 8,1 51856
936647 /usr/lib/erlang/erts-5.10.2/bin/epmd
epmd 1223 rabbitmq mem REG 8,1 1853400
394057 /lib/x86_64-linux-gnu/libc-2.17.so
epmd 1223 rabbitmq mem REG 8,1 1063328
394030 /lib/x86_64-linux-gnu/libm-2.17.so
epmd 1223 rabbitmq mem REG 8,1 149312
394033 /lib/x86_64-linux-gnu/ld-2.17.so
epmd 1223 rabbitmq 0r CHR 1,3 0t0
1029 /dev/null
epmd 1223 rabbitmq 1w CHR 1,3 0t0
1029 /dev/null
epmd 1223 rabbitmq 2w CHR 1,3 0t0
1029 /dev/null
epmd 1223 rabbitmq 3u IPv4 10155 0t0
TCP *:epmd (LISTEN)
and it's listening too...
>
> WHY is it waiting for node2, node3, if node1 was the last to stop, and
>> it should come online in itself without them?
>>
>
> Very good question.
>
> I was able to reproduce this precise error message by doing the equivalent
> of:
>
> (stop node1)
> rabbitmqctl -n node2 stop_app
> rabbitmqctl -n node3 stop_app
> (start node1)
>
> i./e. leaving the Erlang VM running but stopping RabbitMQ (and Mnesia) on
> node2 and node3.
>
> Is there any possibility at all you could have done something like this?
> Is there a beam.smp process running on those nodes?
>
I don't run rabbitmqctl stop_app commands by hand, I've got these things
(creating a cluster, changing nodetype) scripted, so no, that's not the
case. Even if I run something that needs stop_app, there was a start_app in
a second.
I am especially puzzled since the nodes have registered 25672 as a
> distribution port - prior to 3.3.0 the distribution port would be chosen at
> random, or you could configure it. So the presence of 25672 strongly
> indicates that those nodes have been upgraded to 3.3.0 and are running an
> Erlang VM - with RabbitMQ stopped?
I think that the standard apt-get process is doing something like this:
service rabbitmq-server stop
upgrade
service rabbitmq-server start
but as we've already identified rabbitmq-server stop does leave an epmd
running, as it did with 3.1.3, 3.2.2, 3.2.3, 3.2.4, 3.3.0
so after upgradeing the epmd isn't fresh, it may have previous settings,
apps running.
>
> The cookie is the same, the hostnames are correct.
>>
>
> If I am correct about the diagnosis above then the diagnostics don't
> currently handle this situation well; they don't check for the case where
> Erlang is running but RabbitMQ isn't. This will get fixed in the next
> release.
Intensive logging would help too.
>
> I'm trying to start node2 and node3 to have at least one running but
>> they always fail, even if I copied back the previous disk state, and
>> upgraded again...
>>
>
> How do they fail? Because it sounds like something is running.
Node2 and node3 always exited with the message that they are not the one
stopped the last, and if I removed the locks they tried to connect to each
other - while repairing the mnesia db, and as the network connect failed
(as they were not accepting connections probably because the db repair was
on), they quit, leaving the repair in half. So the nodes were not able to
communicate because of the db repair (this is a theory only), and as the
timeout arrived they quit before the repair would finish. (It would be nice
to have a message like "starting repair" and "repair finished", as the log
message currently is not clear about the state of the repair, I'm not sure
that if the repair ends I'll get a logitem.)
==== cut here ====
At this very moment I have the following situation (after the clean 3.3.0
install on all nodes, and cluster rebuilt):
node1,node2,node3 is after an apt-get upgrade, and needs restarting
node1 restart done, fine.
node3 restart done, fine.
node2: sudo reboot, and it's just waiting forever (without the usual
conneciton closed by foreign host)
root at node2:~# rabbitmqctl status
Status of node rabbit at node2 ...
Error: unable to connect to node rabbit at node2: nodedown
DIAGNOSTICS
===========
attempted to contact: [rabbit at node2]
rabbit at node2:
* rabbit seems not to be running at all
* other nodes on node2: [rabbitmqctl13579]
current node details:
- node name: rabbitmqctl14147 at node2
- home dir: /var/lib/rabbitmq
- cookie hash: XSquoZCiCPIfZlTilwy1Dg==
....
root at node2:~# ps aux|grep rabbit
rabbitmq 1221 0.0 0.0 9416 2372 ? S Apr02 0:10
/usr/lib/erlang/erts-5.10.2/bin/epmd -daemon
rabbitmq 2189 0.0 0.0 4440 624 ? S Apr02 0:00 /bin/sh
/usr/sbin/rabbitmq-server
rabbitmq 2236 2.3 0.6 995960 53324 ? Sl Apr02 211:08
/usr/lib/erlang/erts-5.10.2/bin/beam.smp -W w -K true -A30 -P 1048576 --
-root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa
/usr/lib/rabbitmq/lib/rabbitmq_server-3.3.0/sbin/../ebin -noshell -noinput
-s rabbit boot -sname rabbit at node2 -boot start_sasl -config
/etc/rabbitmq/rabbitmq -kernel inet_default_connect_options
[{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false
-rabbit error_logger {file,"/var/log/rabbitmq/rabbit at node2.log"} -rabbit
sasl_error_logger {file,"/var/log/rabbitmq/rabbit at node2-sasl.log"} -rabbit
enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir
"/usr/lib/rabbitmq/lib/rabbitmq_server-3.3.0/sbin/../plugins" -rabbit
plugins_expand_dir "/var/lib/rabbitmq/mnesia/rabbit at node2-plugins-expand"
-os_mon start_cpu_sup false -os_mon start_disksup false -os_mon
start_memsup false -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit at node2"
-kernel inet_dist_listen_min 25672 -kernel inet_dist_listen_max 25672
root 13513 0.0 0.0 4440 776 ? S 00:42 0:00 /bin/sh
/etc/rc6.d/K20rabbitmq-server stop
root 13568 0.0 0.0 4440 680 ? S 00:42 0:00 /bin/sh
/usr/sbin/rabbitmqctl stop /var/run/rabbitmq/pid
root 13577 0.0 0.0 53204 1652 ? S 00:42 0:00 su
rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmqctl "stop"
"/var/run/rabbitmq/pid"
rabbitmq 13578 0.0 0.0 4440 624 ? Ss 00:42 0:00 sh -c
/usr/lib/rabbitmq/bin/rabbitmqctl "stop" "/var/run/rabbitmq/pid"
rabbitmq 13579 0.1 0.1 437204 14676 ? Sl 00:42 0:01
/usr/lib/erlang/erts-5.10.2/bin/beam.smp -- -root /usr/lib/erlang -progname
erl -- -home /var/lib/rabbitmq -- -pa
/usr/lib/rabbitmq/lib/rabbitmq_server-3.3.0/sbin/../ebin -noshell -noinput
-hidden -sname rabbitmqctl13579 -boot start_clean -s rabbit_control_main
-nodename rabbit at node2 -extra stop /var/run/rabbitmq/pid
rabbitmq 13611 0.0 0.0 7460 432 ? Ss 00:42 0:00
inet_gethost 4
rabbitmq 13612 0.0 0.0 9560 636 ? S 00:42 0:00
inet_gethost 4
root 14958 0.0 0.0 11196 948 pts/0 S+ 01:00 0:00 grep
--color=auto rabbit
NOTE that, we have the "K20rabbitmq-server stop" running, we have the
rabbitmq processes too, but currently neither rabbitmqctl nor management
plugin is accessible, and the reboot process hung.
This was the very reason about month ago, we first killed the rabbitmq
processes. After that we thought "maybe we corrupted mnesia with the kill,
and we don't know why it hang for the first time".
Unfortunately as the computer is in a "going down to reboot" state, my only
shell is the logged in one (one can't get a new connection while the system
is going down).... but by starting sshd manually I can have this node
powered up for a while.
Log files:
the sasl log on node2 is empty (yeeah)
shutdown_err and startup_err is empty (whoa)
startup_log is normal
shutdown_log
"Stopping and halting node rabbit at node2 ...
"
and it has the timestamp of the reboot command
node2.log:
"=INFO REPORT==== 9-Apr-2014::00:42:31 ===
Stopping RabbitMQ
=INFO REPORT==== 9-Apr-2014::00:42:31 ===
stopped TCP Listener on [::]:5672
....a few closed connections....
=INFO REPORT==== 9-Apr-2014::00:42:32 ===
Halting Erlang VM
"
so it looks like, that rabbitmq really exited 25 minutes ago, but the stop
script won't exit, as the processes still running.
Could the problem be the inet_gethost process? Unfortunately I've seen dns
problems in our networks, so that could be a problem, but that should not
hang the shutdown.
The running rabbit processes and their open files are here:
http://pastebin.com/mLKf5mEu
root at node2:~# pstree
init─┬─dbus-daemon
...
├─epmd
├─rabbitmq-server───beam.smp───38*[{beam.smp}]
├─rc───K20rabbitmq-ser───rabbitmqctl───su───sh───beam.smp─┬─inet_gethost───inet_gethost
│
└─18*[{beam.smp}]
...
So, I'm looking to find the hangup process... straceing and killing them.
The first try is the gethosts:
root at node2:~# strace -p 13612
Process 13612 attached
read(0,
....
^CProcess 13612 detached
<detached ...>
root at node2:~# strace -p 13611
Process 13611 attached
select(4, [0 3], NULL, NULL, NULL
....
^CProcess 13611 detached
<detached ...>
root at node2:~# kill 13612
root at node2:~# kill 13611
weee, after this second kill we've got the inet_gethost respawned, so the
beam.smp (pid 13579) still works (but it should have already quit :) )
We know that some from these are waiting:
├─rc(13415)───K20rabbitmq-ser(13513)───rabbitmqctl(13568)───su(13577)───sh(13578)───beam.smp(13579)─┬─inet_gethost(16346)
│
├─{beam.smp}(13591)
│
├─{beam.smp}(13592)
│
├─{beam.smp}(13593)
...
│
└─{beam.smp}(13608)
Any ideas? How can I debug this?
Specific questions?
I will leave this system in this state for now, to allow us to diagnose
the situation. Unfortunately I don't know erlang, but I can try anything if
I can help solve these hangups. As I've said this is the same behaviour as
3.1.3, 3.2.2, 3.2.3 and so on worked, they sometimes just hung, then we
killed them and had that bad feeling in our hearts, that we may broke the
db. I won't kill this time. :)
If you can "remote control" me, I may even try to erlang debug the
processes, to guess what they're waiting for.
Thank you for your help, I hope these hangups could get eliminated.
Have a nice day,
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140409/16298824/attachment.html>
More information about the rabbitmq-discuss
mailing list