[rabbitmq-discuss] Upgrade fail

Wed Apr 9 12:43:05 BST 2014

On 09/04/14 00:48, Peter Kopias wrote:
> Just starting and stopping rabbitmq-server leaves one running epmd
> process in memory! I've noticed it before, but it looks like it's intended.

Yes. The epmd process staying around is not related to anything else 
you're seeing, that's how epmd is supposed to work.

epmd is a very simple daemon which just maps names to port numbers for 
Erlang (beam) processes. It doesn't contain any other state.

>   Node2 and node3 always exited with the message that they are not the
> one stopped the last, and if I removed the locks they tried to connect
> to each other - while repairing the mnesia db, and as the network
> connect failed (as they were not accepting connections probably because
> the db repair was on), they quit, leaving the repair in half. So the
> nodes were not able to communicate because of the db repair (this is a
> theory only), and as the timeout arrived they quit before the repair
> would finish. (It would be nice to have a message like "starting repair"
> and "repair finished", as the log message currently is not clear about
> the state of the repair, I'm not sure that if the repair ends I'll get a
> logitem.)

If you are talking about the upgrades, then

"mnesia upgrades: N to apply"

marks the start, and

"mnesia upgrades: All upgrades applied successfully"

marks the end.

> node2: sudo reboot, and it's just waiting forever (without the usual
> conneciton closed by foreign host)
>
> root at node2:~# rabbitmqctl status
> Status of node rabbit at node2 ...
> Error: unable to connect to node rabbit at node2: nodedown
>
> DIAGNOSTICS
> ===========
>
> attempted to contact: [rabbit at node2]
>
> rabbit at node2:
>    * rabbit seems not to be running at all
>    * other nodes on node2: [rabbitmqctl13579]

<snip>

> NOTE that, we have the "K20rabbitmq-server stop"  running, we have the
> rabbitmq processes too, but currently neither rabbitmqctl nor management
> plugin is accessible, and the reboot process hung.

So this is very weird. The beam process is still around, but everything 
in the shutdown appears to have succeeded.

By the time "Halting Erlang VM" appears in the logs RabbitMQ is 
completely stopped, that log message is literally the last thing we do 
before telling the VM to stop.

>   Could the problem be the inet_gethost process? Unfortunately I've seen
> dns problems in our networks, so that could be a problem, but that
> should not hang the shutdown.

No, it really shouldn't.

The beam process has already unregistered from epmd (hence "rabbit seems 
not to be running at all" in the status command above).

>   The running rabbit processes and their open files are here:
>
> http://pastebin.com/mLKf5mEu

And it seems to have few files open, and to have closed all network sockets.

>   Any ideas? How can I debug this?
>   Specific questions?

I am not full of ideas here, this looks like more of an issue with 
Erlang than RabbitMQ. But some things to look into:

The beam process (2236 in your case) is the interesting one; it's the 
thing that should have shut down but hasn't; everything else is pretty 
much as expected. Unfortunately although it's an Erlang process it has 
stopped listening for Erlang distribution messages so debugging it is 
likely to be hard. Having said that, strace might give some clue as to 
what it's doing, and the small bright side is that so much within that 
process has already shut down that *anything* it is still doing is a 
good candidate for what's making it stuck.

So what does strace say for it?

>   Thank you for your help, I hope these hangups could get eliminated.

I would hope so. We don't have much to go on though.

Cheers, Simon

-- 
Simon MacMullen
RabbitMQ, Pivotal