Just a quick thing, but I have also noticed this behaviour on both 1.8 and 2.0, when being used with Celery.<div><br></div><div>I don't know why it happens, the error log shows nothing has gone wrong....</div><div><br>
</div><div>In the end I had to abandon RabbitMQ because of this :/<br><br><div class="gmail_quote">On Tue, Sep 14, 2010 at 8:13 PM, Noah Fontes <span dir="ltr"><<a href="mailto:nfontes@cynigram.com">nfontes@cynigram.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">-----BEGIN PGP SIGNED MESSAGE-----<br>
Hash: SHA1<br>
<br>
Hello Dave,<br>
<div class="im"><br>
On 09/14/2010 10:11 AM, Dave Greggory wrote:<br>
> So it happened again this morning.<br>
><br>
> rabbitmqctl status, list_connections and list_exchanges worked, but list_queues<br>
> and list_channels hung.<br>
><br>
> This time there were no errors in the log, unlike the last time. This has been<br>
> quite common, that when it happens there's nothing in the logs. That's why I<br>
> didn't report it any earlier. Very mysterious.<br>
<br>
</div>This is quite interesting. We observed this behavior as well --<br>
list_queues and list_channels hanging. This was also reflected in<br>
consumers/publishers: we could publish messages fine, but trying to read<br>
from a queue (or even delete one) would hang usually indefinitely.<br>
<br>
We also noted that if we repeatedly attempted to run list_queues the RPC<br>
call would eventually succeed -- maybe once out of 10 or 15 runs. With<br>
the exception of certain queues building up with messages (as I<br>
mentioned above) everything looked fine.<br>
<br>
It started when we switched from 1.7.x to 1.8.x (which we're still<br>
running for the moment). It only seems to happen when nodes are<br>
clustered; I've never seen the problem on a non-clustered instance.<br>
<br>
I'll try to grab some more information when/if it happens again for us.<br>
<br>
I also haven't seen the issue occur in probably about 3 weeks now. It's<br>
very sporadic, although I think I've seen it happen more than once in a<br>
day (and then not again for a long time).<br>
<div class="im"><br>
> I have attached the output of status, list_connections, dmesg, and lsof from<br>
> both rabbitmq nodes in the cluster.<br>
<br>
</div>FWIW, here's the minimal information I can offer now:<br>
<br>
- - We have a four-node cluster of two disk nodes and two memory nodes<br>
across two physical servers.<br>
- - We're running RabbitMQ 1.8.1 with no additional plugins:<br>
{rabbit,"RabbitMQ","1.8.1"},<br>
{mnesia,"MNESIA CXC 138 12","4.4.13"},<br>
{os_mon,"CPO CXC 138 46","2.2.5"},<br>
{sasl,"SASL CXC 138 11","2.1.9"},<br>
{stdlib,"ERTS CXC 138 10","1.16.5"},<br>
{kernel,"ERTS CXC 138 10","2.13.5"}<br>
<br>
This is erlang R13B04 on SuSE Linux.<br>
<br>
Hopefully this can shed a *little* more light on the problem. Sorry I<br>
can't offer more details at the moment.<br>
<br>
Regards,<br>
<br>
Noah<br>
<div><div></div><div class="h5"><br>
> ----- Original Message ----<br>
> From: Dave Greggory <<a href="mailto:davegreggory@yahoo.com">davegreggory@yahoo.com</a>><br>
> To: Matthew Sackman <<a href="mailto:matthew@rabbitmq.com">matthew@rabbitmq.com</a>>; <a href="mailto:rabbitmq-discuss@lists.rabbitmq.com">rabbitmq-discuss@lists.rabbitmq.com</a><br>
> Sent: Mon, September 13, 2010 11:48:44 AM<br>
> Subject: Re: [rabbitmq-discuss] RabbitMQ 2.0 hanging<br>
><br>
> Wow... ok.<br>
><br>
> I'll grab lsof / dmesg / syslog output next time this happens.<br>
><br>
> Thanks for looking into it. Much appreciated.<br>
><br>
><br>
><br>
> ----- Original Message ----<br>
> From: Matthew Sackman <<a href="mailto:matthew@rabbitmq.com">matthew@rabbitmq.com</a>><br>
> To: <a href="mailto:rabbitmq-discuss@lists.rabbitmq.com">rabbitmq-discuss@lists.rabbitmq.com</a><br>
> Sent: Mon, September 13, 2010 10:53:24 AM<br>
> Subject: Re: [rabbitmq-discuss] RabbitMQ 2.0 hanging<br>
><br>
> Hi Dave,<br>
><br>
> Sorry for the delay in getting back to you.<br>
><br>
> Your node1 log had this in it:<br>
><br>
> =ERROR REPORT==== 8-Sep-2010::09:45:43 ===<br>
> ** Generic server <0.29.0> terminating<br>
> ** Last message in was {'EXIT',<0.30.0>,eio}<br>
> ** When Server state == {state,user_sup,undefined,<0.30.0>,<br>
> {<0.29.0>,user_sup}}<br>
> ** Reason for termination ==<br>
> ** eio<br>
><br>
> This is utterly bizarre - we've never seen it before and it was<br>
> certainly enough to take down the node1 or at least hang it.<br>
><br>
> node2 log has:<br>
><br>
> =ERROR REPORT==== 8-Sep-2010::09:41:38 ===<br>
> ** Generic server delegate_process_0 terminating<br>
> ** Last message in was {'$gen_cast',{thunk,#Fun<delegate.4.123807736>}}<br>
> ** When Server state == no_state<br>
> ** Reason for termination ==<br>
> ** {noproc,{gen_server2,call,<br>
> [{delegate_process_1,'rabbit@ent-jms-qa-1'},<br>
> {thunk,#Fun<delegate.5.131821234>},<br>
> infinity]}}<br>
><br>
> This is basically node2 finding that node1 has gone down. This suggests<br>
> (as does your timeline) that node1 actually failed some time previously<br>
> but that the immediate error was not logged and only at some later point<br>
> did a very generic "eio" come out of it - literally error in some form<br>
> of IO operation.<br>
><br>
> Now the eio comes out of process <0.30.0> which is a process which is<br>
> started very early on in the Erlang VM boot process. I can't quite tell<br>
> what the user_sup process is meant to be doing - it's so far buried that<br>
> there's no documentation for it. It's quite possible you've found a bug<br>
> in Erlang itself. Even having googled around for a while, I still can't<br>
> really find out what "user" is for - the best I can find is:<br>
> "user is a server which responds to all the messages defined in the I/O<br>
> interface. The code in user.erl can be used as a model for building<br>
> alternative I/O servers." so that's nice and clear. Anyway, my guess is<br>
> some error came out of said I/O server, took out user and user_sup which<br>
> was then logged. But as to what the fault actually was, I'm afraid I<br>
> have no idea.<br>
><br>
> When this next happens, any chance you could check things like number of<br>
> open file descriptors, see if there's any kernel log messages relevant<br>
> etc? Sorry I can't be more helpful - it's just not something we've ever<br>
> come across before.<br>
><br>
> Matthew<br>
</div></div>-----BEGIN PGP SIGNATURE-----<br>
Version: GnuPG v1.4.10 (GNU/Linux)<br>
<br>
iEYEARECAAYFAkyPyWEACgkQhitK+HuUQJRLpwCgnYY/YF8xTUW8xowocWKKPzbJ<br>
BzUAn1aRtruRAgp/23v4mZB1JJXrBIaE<br>
=CEzP<br>
-----END PGP SIGNATURE-----<br>
<div><div></div><div class="h5">_______________________________________________<br>
rabbitmq-discuss mailing list<br>
<a href="mailto:rabbitmq-discuss@lists.rabbitmq.com">rabbitmq-discuss@lists.rabbitmq.com</a><br>
<a href="https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss" target="_blank">https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss</a><br>
</div></div></blockquote></div><br><br clear="all"><br>-- <br><p style="color:rgb(0, 51, 102);font-weight:bold"><span style="border-collapse:separate;font-family:arial;line-height:normal;font-size:small">Cal Leeming</span></p>
<p style="color:rgb(0, 51, 102);font-weight:bold">Operational Security & Support Team<br></p><p style="border-bottom-width:1px;border-bottom-style:solid;border-bottom-color:rgb(204, 204, 204);padding-bottom:25px"><b>Out of Hours: </b>+44 (07534) 971120 | <b>Support Tickets: </b><a href="mailto:support@simplicitymedialtd.co.uk" style="color:rgb(49, 132, 173)" target="_blank">support@simplicitymedialtd.co.uk</a> <br>
<b>Fax: </b>+44 (02476) 578987 | <b>Email: </b><a href="mailto:cal.leeming@simplicitymedialtd.co.uk" style="color:rgb(49, 132, 173)" target="_blank">cal.leeming@simplicitymedialtd.co.uk</a> <br><b>IM: </b>AIM / ICQ / MSN / Skype (available upon request)</p>
<div><span style="line-height:13px;font-size:10px;color:rgb(185, 184, 184)">Simplicity Media Ltd. All rights reserved.<br></span></div><div><span style="line-height:13px;font-size:10px;color:rgb(185, 184, 184)">Registered company number 7143564</span></div>
<br>
</div>