[rabbitmq-discuss] RabbitMQ 2.0 hanging

Noah Fontes nfontes at cynigram.com
Tue Sep 14 20:13:38 BST 2010


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello Dave,

On 09/14/2010 10:11 AM, Dave Greggory wrote:
> So it happened again this morning. 
> 
> rabbitmqctl status, list_connections and list_exchanges worked, but list_queues 
> and list_channels hung.
> 
> This time there were no errors in the log, unlike the last time. This has been 
> quite common, that when it happens there's nothing in the logs. That's why I 
> didn't report it any earlier. Very mysterious.

This is quite interesting. We observed this behavior as well --
list_queues and list_channels hanging. This was also reflected in
consumers/publishers: we could publish messages fine, but trying to read
from a queue (or even delete one) would hang usually indefinitely.

We also noted that if we repeatedly attempted to run list_queues the RPC
call would eventually succeed -- maybe once out of 10 or 15 runs. With
the exception of certain queues building up with messages (as I
mentioned above) everything looked fine.

It started when we switched from 1.7.x to 1.8.x (which we're still
running for the moment). It only seems to happen when nodes are
clustered; I've never seen the problem on a non-clustered instance.

I'll try to grab some more information when/if it happens again for us.

I also haven't seen the issue occur in probably about 3 weeks now. It's
very sporadic, although I think I've seen it happen more than once in a
day (and then not again for a long time).

> I have attached the output of status, list_connections, dmesg, and lsof from 
> both rabbitmq nodes in the cluster.

FWIW, here's the minimal information I can offer now:

- - We have a four-node cluster of two disk nodes and two memory nodes
across two physical servers.
- - We're running RabbitMQ 1.8.1 with no additional plugins:
{rabbit,"RabbitMQ","1.8.1"},
{mnesia,"MNESIA  CXC 138 12","4.4.13"},
{os_mon,"CPO  CXC 138 46","2.2.5"},
{sasl,"SASL  CXC 138 11","2.1.9"},
{stdlib,"ERTS  CXC 138 10","1.16.5"},
{kernel,"ERTS  CXC 138 10","2.13.5"}

This is erlang R13B04 on SuSE Linux.

Hopefully this can shed a *little* more light on the problem. Sorry I
can't offer more details at the moment.

Regards,

Noah

> ----- Original Message ----
> From: Dave Greggory <davegreggory at yahoo.com>
> To: Matthew Sackman <matthew at rabbitmq.com>; rabbitmq-discuss at lists.rabbitmq.com
> Sent: Mon, September 13, 2010 11:48:44 AM
> Subject: Re: [rabbitmq-discuss] RabbitMQ 2.0 hanging
> 
> Wow... ok.
> 
> I'll grab lsof / dmesg / syslog output next time this happens.
> 
> Thanks for looking into it. Much appreciated.
> 
> 
> 
> ----- Original Message ----
> From: Matthew Sackman <matthew at rabbitmq.com>
> To: rabbitmq-discuss at lists.rabbitmq.com
> Sent: Mon, September 13, 2010 10:53:24 AM
> Subject: Re: [rabbitmq-discuss] RabbitMQ 2.0 hanging
> 
> Hi Dave,
> 
> Sorry for the delay in getting back to you.
> 
> Your node1 log had this in it:
> 
> =ERROR REPORT==== 8-Sep-2010::09:45:43 ===
> ** Generic server <0.29.0> terminating
> ** Last message in was {'EXIT',<0.30.0>,eio}
> ** When Server state == {state,user_sup,undefined,<0.30.0>,
>                                {<0.29.0>,user_sup}}
> ** Reason for termination ==
> ** eio
> 
> This is utterly bizarre - we've never seen it before and it was
> certainly enough to take down the node1 or at least hang it.
> 
> node2 log has:
> 
> =ERROR REPORT==== 8-Sep-2010::09:41:38 ===
> ** Generic server delegate_process_0 terminating
> ** Last message in was {'$gen_cast',{thunk,#Fun<delegate.4.123807736>}}
> ** When Server state == no_state
> ** Reason for termination ==
> ** {noproc,{gen_server2,call,
>                         [{delegate_process_1,'rabbit at ent-jms-qa-1'},
>                          {thunk,#Fun<delegate.5.131821234>},
>                          infinity]}}
> 
> This is basically node2 finding that node1 has gone down. This suggests
> (as does your timeline) that node1 actually failed some time previously
> but that the immediate error was not logged and only at some later point
> did a very generic "eio" come out of it - literally error in some form
> of IO operation.
> 
> Now the eio comes out of process <0.30.0> which is a process which is
> started very early on in the Erlang VM boot process. I can't quite tell
> what the user_sup process is meant to be doing - it's so far buried that
> there's no documentation for it. It's quite possible you've found a bug
> in Erlang itself. Even having googled around for a while, I still can't
> really find out what "user" is for - the best I can find is:
> "user is a server which responds to all the messages defined in the I/O
> interface. The code in user.erl can be used as a model for building
> alternative I/O servers." so that's nice and clear. Anyway, my guess is
> some error came out of said I/O server, took out user and user_sup which
> was then logged. But as to what the fault actually was, I'm afraid I
> have no idea.
> 
> When this next happens, any chance you could check things like number of
> open file descriptors, see if there's any kernel log messages relevant
> etc? Sorry I can't be more helpful - it's just not something we've ever
> come across before.
> 
> Matthew
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)

iEYEARECAAYFAkyPyWEACgkQhitK+HuUQJRLpwCgnYY/YF8xTUW8xowocWKKPzbJ
BzUAn1aRtruRAgp/23v4mZB1JJXrBIaE
=CEzP
-----END PGP SIGNATURE-----


More information about the rabbitmq-discuss mailing list