[rabbitmq-discuss] RabbitMQ 2.0 hanging

Cal Leeming [Simplicity Media Ltd] cal.leeming at simplicitymedialtd.co.uk
Tue Sep 14 20:15:37 BST 2010


Just a quick thing, but I have also noticed this behaviour on both 1.8 and
2.0, when being used with Celery.

I don't know why it happens, the error log shows nothing has gone wrong....

In the end I had to abandon RabbitMQ because of this :/

On Tue, Sep 14, 2010 at 8:13 PM, Noah Fontes <nfontes at cynigram.com> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hello Dave,
>
> On 09/14/2010 10:11 AM, Dave Greggory wrote:
> > So it happened again this morning.
> >
> > rabbitmqctl status, list_connections and list_exchanges worked, but
> list_queues
> > and list_channels hung.
> >
> > This time there were no errors in the log, unlike the last time. This has
> been
> > quite common, that when it happens there's nothing in the logs. That's
> why I
> > didn't report it any earlier. Very mysterious.
>
> This is quite interesting. We observed this behavior as well --
> list_queues and list_channels hanging. This was also reflected in
> consumers/publishers: we could publish messages fine, but trying to read
> from a queue (or even delete one) would hang usually indefinitely.
>
> We also noted that if we repeatedly attempted to run list_queues the RPC
> call would eventually succeed -- maybe once out of 10 or 15 runs. With
> the exception of certain queues building up with messages (as I
> mentioned above) everything looked fine.
>
> It started when we switched from 1.7.x to 1.8.x (which we're still
> running for the moment). It only seems to happen when nodes are
> clustered; I've never seen the problem on a non-clustered instance.
>
> I'll try to grab some more information when/if it happens again for us.
>
> I also haven't seen the issue occur in probably about 3 weeks now. It's
> very sporadic, although I think I've seen it happen more than once in a
> day (and then not again for a long time).
>
> > I have attached the output of status, list_connections, dmesg, and lsof
> from
> > both rabbitmq nodes in the cluster.
>
> FWIW, here's the minimal information I can offer now:
>
> - - We have a four-node cluster of two disk nodes and two memory nodes
> across two physical servers.
> - - We're running RabbitMQ 1.8.1 with no additional plugins:
> {rabbit,"RabbitMQ","1.8.1"},
> {mnesia,"MNESIA  CXC 138 12","4.4.13"},
> {os_mon,"CPO  CXC 138 46","2.2.5"},
> {sasl,"SASL  CXC 138 11","2.1.9"},
> {stdlib,"ERTS  CXC 138 10","1.16.5"},
> {kernel,"ERTS  CXC 138 10","2.13.5"}
>
> This is erlang R13B04 on SuSE Linux.
>
> Hopefully this can shed a *little* more light on the problem. Sorry I
> can't offer more details at the moment.
>
> Regards,
>
> Noah
>
> > ----- Original Message ----
> > From: Dave Greggory <davegreggory at yahoo.com>
> > To: Matthew Sackman <matthew at rabbitmq.com>;
> rabbitmq-discuss at lists.rabbitmq.com
> > Sent: Mon, September 13, 2010 11:48:44 AM
> > Subject: Re: [rabbitmq-discuss] RabbitMQ 2.0 hanging
> >
> > Wow... ok.
> >
> > I'll grab lsof / dmesg / syslog output next time this happens.
> >
> > Thanks for looking into it. Much appreciated.
> >
> >
> >
> > ----- Original Message ----
> > From: Matthew Sackman <matthew at rabbitmq.com>
> > To: rabbitmq-discuss at lists.rabbitmq.com
> > Sent: Mon, September 13, 2010 10:53:24 AM
> > Subject: Re: [rabbitmq-discuss] RabbitMQ 2.0 hanging
> >
> > Hi Dave,
> >
> > Sorry for the delay in getting back to you.
> >
> > Your node1 log had this in it:
> >
> > =ERROR REPORT==== 8-Sep-2010::09:45:43 ===
> > ** Generic server <0.29.0> terminating
> > ** Last message in was {'EXIT',<0.30.0>,eio}
> > ** When Server state == {state,user_sup,undefined,<0.30.0>,
> >                                {<0.29.0>,user_sup}}
> > ** Reason for termination ==
> > ** eio
> >
> > This is utterly bizarre - we've never seen it before and it was
> > certainly enough to take down the node1 or at least hang it.
> >
> > node2 log has:
> >
> > =ERROR REPORT==== 8-Sep-2010::09:41:38 ===
> > ** Generic server delegate_process_0 terminating
> > ** Last message in was {'$gen_cast',{thunk,#Fun<delegate.4.123807736>}}
> > ** When Server state == no_state
> > ** Reason for termination ==
> > ** {noproc,{gen_server2,call,
> >                         [{delegate_process_1,'rabbit at ent-jms-qa-1'},
> >                          {thunk,#Fun<delegate.5.131821234>},
> >                          infinity]}}
> >
> > This is basically node2 finding that node1 has gone down. This suggests
> > (as does your timeline) that node1 actually failed some time previously
> > but that the immediate error was not logged and only at some later point
> > did a very generic "eio" come out of it - literally error in some form
> > of IO operation.
> >
> > Now the eio comes out of process <0.30.0> which is a process which is
> > started very early on in the Erlang VM boot process. I can't quite tell
> > what the user_sup process is meant to be doing - it's so far buried that
> > there's no documentation for it. It's quite possible you've found a bug
> > in Erlang itself. Even having googled around for a while, I still can't
> > really find out what "user" is for - the best I can find is:
> > "user is a server which responds to all the messages defined in the I/O
> > interface. The code in user.erl can be used as a model for building
> > alternative I/O servers." so that's nice and clear. Anyway, my guess is
> > some error came out of said I/O server, took out user and user_sup which
> > was then logged. But as to what the fault actually was, I'm afraid I
> > have no idea.
> >
> > When this next happens, any chance you could check things like number of
> > open file descriptors, see if there's any kernel log messages relevant
> > etc? Sorry I can't be more helpful - it's just not something we've ever
> > come across before.
> >
> > Matthew
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.10 (GNU/Linux)
>
> iEYEARECAAYFAkyPyWEACgkQhitK+HuUQJRLpwCgnYY/YF8xTUW8xowocWKKPzbJ
> BzUAn1aRtruRAgp/23v4mZB1JJXrBIaE
> =CEzP
> -----END PGP SIGNATURE-----
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>



-- 

Cal Leeming

Operational Security & Support Team

*Out of Hours: *+44 (07534) 971120 | *Support Tickets: *
support at simplicitymedialtd.co.uk
*Fax: *+44 (02476) 578987 | *Email: *cal.leeming at simplicitymedialtd.co.uk
*IM: *AIM / ICQ / MSN / Skype (available upon request)
Simplicity Media Ltd. All rights reserved.
Registered company number 7143564
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20100914/481c998f/attachment-0001.htm>


More information about the rabbitmq-discuss mailing list