[rabbitmq-discuss] Upgrade fail

Peter Kopias kopias.peter at gmail.com
Tue Apr 15 00:23:32 BST 2014


Hi!

 I've just noticed I had this email unsent, waiting in draft for a few days.

 So this is just for closing down the thread with "feedback".

 It seems like, that erlang hangs sometimes, waiting for an empty list of
files, with no timeouts... it's probably fixed in version 17.

 Thank you for all your help,

 Peter


> Yes. The epmd process staying around is not related to anything else
> you're seeing, that's how epmd is supposed to work.
>
> epmd is a very simple daemon which just maps names to port numbers for
> Erlang (beam) processes. It doesn't contain any other state.

Ok.

 BTW IMO the "unix way" would be that if I start and stop something, the
machine should be in the state it was before. No processes left. :) But I
can live with this. :) We all like our systems well behaved and
deterministic. :)

   Node2 and node3 always exited with the message that they are not the
>> one stopped the last, and if I removed the locks they tried to connect
>> to each other - while repairing the mnesia db, and as the network
>> connect failed (as they were not accepting connections probably because
>> the db repair was on), they quit, leaving the repair in half. So the
>> nodes were not able to communicate because of the db repair (this is a
>> theory only), and as the timeout arrived they quit before the repair
>> would finish. (It would be nice to have a message like "starting repair"
>> and "repair finished", as the log message currently is not clear about
>> the state of the repair, I'm not sure that if the repair ends I'll get a
>> logitem.)
>>
>
> If you are talking about the upgrades, then
>
No. No upgrades. It was telling me, that some db files are damaged, and
needs to repair them. I'll copy the messages here, but they don't contain
too much information about what's the problem.

...


>  NOTE that, we have the "K20rabbitmq-server stop"  running, we have the
>> rabbitmq processes too, but currently neither rabbitmqctl nor management
>> plugin is accessible, and the reboot process hung.
>>
>
> So this is very weird. The beam process is still around, but everything in
> the shutdown appears to have succeeded.
>
> By the time "Halting Erlang VM" appears in the logs RabbitMQ is completely
> stopped, that log message is literally the last thing we do before telling
> the VM to stop.
>
>
>> No, it really shouldn't.
>
> The beam process has already unregistered from epmd (hence "rabbit seems
> not to be running at all" in the status command above).
>
>
>    The running rabbit processes and their open files are here:
>>
>> http://pastebin.com/mLKf5mEu
>>
>
> And it seems to have few files open, and to have closed all network
> sockets.
>
>
>    Any ideas? How can I debug this?
>>   Specific questions?
>>
>
> I am not full of ideas here, this looks like more of an issue with Erlang
> than RabbitMQ. But some things to look into:
>
> The beam process (2236 in your case) is the interesting one; it's the
> thing that should have shut down but hasn't; everything else is pretty much
> as expected. Unfortunately although it's an Erlang process it has stopped
> listening for Erlang distribution messages so debugging it is likely to be
> hard. Having said that, strace might give some clue as to what it's doing,
> and the small bright side is that so much within that process has already
> shut down that *anything* it is still doing is a good candidate for what's
> making it stuck.
>
> So what does strace say for it?
>
>
CUT STRAIGHT: This is probably the erlang bug that's been fixed in 17, as
you've written in the next email.

# strace -p 2236
Process 2236 attached
select(0, NULL, NULL, NULL, NULL

 that means:
 - zero filedescriptors to look for
 - empty read filedescriptorlist
 - empty write filedescriptorlist
 - empty exception filedescriptor list
 - empty timeout....

from select(2):

"If timeout is NULL (no timeout), select() can block indefinitely."

This is the cpu-friendly version of

while(1);

 :)

 I'll try to unbalance this via sending the process signals sigint, sighup,
sigterm, sigkill...

Let's try: SIGINT (2)

.....

result:

Connection to ... closed by remote host.

(Note: we we're in a shutdown, when the hangup happened :D)

So, the problem is a "wait for no files with no timeout", which MAY be the
same bug, erlang 17 has fixed, but we don't know for sure. :)


If someone founds a hanging rabbit process after stopping, strace it and
check whether its a 'select (0,null,null,null',  :)

Thanks Simon for your help again!

 Bye,

 Peter

   Thank you for your help, I hope these hangups could get eliminated.
>>
>
> I would hope so. We don't have much to go on though.
>
>
> Cheers, Simon
>
> --
> Simon MacMullen
> RabbitMQ, Pivotal
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140415/f07d009f/attachment.html>


More information about the rabbitmq-discuss mailing list