[rabbitmq-discuss] RabbitMQ 2.0 hanging

Matthew Sackman matthew at rabbitmq.com
Mon Sep 13 15:53:24 BST 2010


Hi Dave,

Sorry for the delay in getting back to you.

Your node1 log had this in it:

=ERROR REPORT==== 8-Sep-2010::09:45:43 ===
** Generic server <0.29.0> terminating
** Last message in was {'EXIT',<0.30.0>,eio}
** When Server state == {state,user_sup,undefined,<0.30.0>,
                               {<0.29.0>,user_sup}}
** Reason for termination ==
** eio

This is utterly bizarre - we've never seen it before and it was
certainly enough to take down the node1 or at least hang it.

node2 log has:

=ERROR REPORT==== 8-Sep-2010::09:41:38 ===
** Generic server delegate_process_0 terminating
** Last message in was {'$gen_cast',{thunk,#Fun<delegate.4.123807736>}}
** When Server state == no_state
** Reason for termination ==
** {noproc,{gen_server2,call,
                        [{delegate_process_1,'rabbit at ent-jms-qa-1'},
                         {thunk,#Fun<delegate.5.131821234>},
                         infinity]}}

This is basically node2 finding that node1 has gone down. This suggests
(as does your timeline) that node1 actually failed some time previously
but that the immediate error was not logged and only at some later point
did a very generic "eio" come out of it - literally error in some form
of IO operation.

Now the eio comes out of process <0.30.0> which is a process which is
started very early on in the Erlang VM boot process. I can't quite tell
what the user_sup process is meant to be doing - it's so far buried that
there's no documentation for it. It's quite possible you've found a bug
in Erlang itself. Even having googled around for a while, I still can't
really find out what "user" is for - the best I can find is:
"user is a server which responds to all the messages defined in the I/O
interface. The code in user.erl can be used as a model for building
alternative I/O servers." so that's nice and clear. Anyway, my guess is
some error came out of said I/O server, took out user and user_sup which
was then logged. But as to what the fault actually was, I'm afraid I
have no idea.

When this next happens, any chance you could check things like number of
open file descriptors, see if there's any kernel log messages relevant
etc? Sorry I can't be more helpful - it's just not something we've ever
come across before.

Matthew


More information about the rabbitmq-discuss mailing list