[rabbitmq-discuss] RabbitMQ 2.0 hanging

Dave Greggory davegreggory at yahoo.com
Tue Sep 14 16:11:45 BST 2010


So it happened again this morning. 

rabbitmqctl status, list_connections and list_exchanges worked, but list_queues 
and list_channels hung.

This time there were no errors in the log, unlike the last time. This has been 
quite common, that when it happens there's nothing in the logs. That's why I 
didn't report it any earlier. Very mysterious.

I have attached the output of status, list_connections, dmesg, and lsof from 
both rabbitmq nodes in the cluster.




----- Original Message ----
From: Dave Greggory <davegreggory at yahoo.com>
To: Matthew Sackman <matthew at rabbitmq.com>; rabbitmq-discuss at lists.rabbitmq.com
Sent: Mon, September 13, 2010 11:48:44 AM
Subject: Re: [rabbitmq-discuss] RabbitMQ 2.0 hanging

Wow... ok.

I'll grab lsof / dmesg / syslog output next time this happens.

Thanks for looking into it. Much appreciated.



----- Original Message ----
From: Matthew Sackman <matthew at rabbitmq.com>
To: rabbitmq-discuss at lists.rabbitmq.com
Sent: Mon, September 13, 2010 10:53:24 AM
Subject: Re: [rabbitmq-discuss] RabbitMQ 2.0 hanging

Hi Dave,

Sorry for the delay in getting back to you.

Your node1 log had this in it:

=ERROR REPORT==== 8-Sep-2010::09:45:43 ===
** Generic server <0.29.0> terminating
** Last message in was {'EXIT',<0.30.0>,eio}
** When Server state == {state,user_sup,undefined,<0.30.0>,
                               {<0.29.0>,user_sup}}
** Reason for termination ==
** eio

This is utterly bizarre - we've never seen it before and it was
certainly enough to take down the node1 or at least hang it.

node2 log has:

=ERROR REPORT==== 8-Sep-2010::09:41:38 ===
** Generic server delegate_process_0 terminating
** Last message in was {'$gen_cast',{thunk,#Fun<delegate.4.123807736>}}
** When Server state == no_state
** Reason for termination ==
** {noproc,{gen_server2,call,
                        [{delegate_process_1,'rabbit at ent-jms-qa-1'},
                         {thunk,#Fun<delegate.5.131821234>},
                         infinity]}}

This is basically node2 finding that node1 has gone down. This suggests
(as does your timeline) that node1 actually failed some time previously
but that the immediate error was not logged and only at some later point
did a very generic "eio" come out of it - literally error in some form
of IO operation.

Now the eio comes out of process <0.30.0> which is a process which is
started very early on in the Erlang VM boot process. I can't quite tell
what the user_sup process is meant to be doing - it's so far buried that
there's no documentation for it. It's quite possible you've found a bug
in Erlang itself. Even having googled around for a while, I still can't
really find out what "user" is for - the best I can find is:
"user is a server which responds to all the messages defined in the I/O
interface. The code in user.erl can be used as a model for building
alternative I/O servers." so that's nice and clear. Anyway, my guess is
some error came out of said I/O server, took out user and user_sup which
was then logged. But as to what the fault actually was, I'm afraid I
have no idea.

When this next happens, any chance you could check things like number of
open file descriptors, see if there's any kernel log messages relevant
etc? Sorry I can't be more helpful - it's just not something we've ever
come across before.

Matthew
_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss at lists.rabbitmq.com
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss



      
_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss at lists.rabbitmq.com
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss



      
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: node2-status.txt
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20100914/60fdefbb/attachment-0008.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: node1-connections.txt
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20100914/60fdefbb/attachment-0009.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: node1-dmesg.txt
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20100914/60fdefbb/attachment-0010.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: node1-lsof.txt
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20100914/60fdefbb/attachment-0011.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: node1-status.txt
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20100914/60fdefbb/attachment-0012.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: node2-connections.txt
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20100914/60fdefbb/attachment-0013.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: node2-dmesg.txt
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20100914/60fdefbb/attachment-0014.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: node2-lsof.txt
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20100914/60fdefbb/attachment-0015.txt>


More information about the rabbitmq-discuss mailing list