Hi Simon,<div><br></div><div>Thanks for that - we'll upgrade this cluster to 2.8.6 as you suggest and let you know how we get on.</div><div><br></div><div>Ian.<br><br><div class="gmail_quote">On Wed, Sep 19, 2012 at 1:08 PM, Simon MacMullen <span dir="ltr"><<a href="mailto:simon@rabbitmq.com" target="_blank">simon@rabbitmq.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Ian.<br>
<br>
We've fixed quite a lot of bugs in mirrored queues since 2.7.1. So I would have to suggest an upgrade to 2.8.6 first of all.<br>
<br>
Cheers, Simon<br>
<br>
On 19/09/12 12:45, Ian wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hi all,<br>
<br>
I wonder if anyone can help diagnose problems we've been having with our<br>
2-node clustered rabbit which sporadically seizes up completely. None of<br>
the applications can get through to Rabbit though it is still up and<br>
running. CPU and RAM spike up to 100%. The Management UI cannot be<br>
reached and we end up having to restart the nodes to get service back.<br>
Sometimes it does not come back gracefully requiring reboot.<br>
<br>
Some stats:<br>
<br>
* Both nodes are 4 Core 8GB RAM CentOS 6.2 virtual machines, running<br>
on VMWare ESXi 4.1 host. We are running RabbitMQ version 2.7.1 on<br>
Erlang R14B04.<br>
* Looking at our metrics right now I see around:<br>
o 1000 queues<br>
o 4000 channels<br>
o 8000 bindings<br>
o 16 exchanges<br>
* Memory usage, erlang processes, file descriptors, socket<br>
descriptors are generally low and healthy<br>
<br>
Analysing errors in the rabbit logs from a recent failure reveals:<br>
<br>
* Before the failure we have a bunch of background errors which may<br>
be the fault of our applications like "no binding X between<br>
exchange Y in vhost '/' and queue Z in vhost '/'"<br>
* As we ramp up to the failure we see<br>
o Two errors like this:<br>
+ “** Generic server <0.16813.1677> terminating ** Last<br>
message in was {'$gen_cast',<br>
{run_backing_queue,rabbit_<u></u>mirror_queue_master,<br>
#Fun<rabbit_mirror_queue_<u></u>master.4.85178772>}} ** When<br>
Server state == {lim,0,undefined,false,[],0} ** Reason<br>
for termination == ** {function_clause,<br>
[{rabbit_limiter,handle_cast,<br>
[{run_backing_queue,rabbit_<u></u>mirror_queue_master,<br>
#Fun<rabbit_mirror_queue_<u></u>master.4.85178772>},<br>
{lim,0,undefined,false,[],0}]}<u></u>,<br>
{gen_server2,handle_msg,2},<br>
{proc_lib,init_p_do_apply,3}]} “<br>
o A handful like this:<br>
+ “connection <0.14270.7735>, channel 38 - error:<br>
{amqp_error,command_invalid,"<u></u>second 'channel.open'<br>
seen",'channel.open'} “<br>
o A couple of these:<br>
+ “connection <0.158.6322>, channel 135 - error:<br>
{amqp_error,not_found, "no queue<br>
'InRunning.WebClient.<u></u>SessionId[<u></u>l0mpn3egx5n0yj0lbs1hcehj]'<br>
in vhost '/'", 'basic.get'} “<br>
o And then all these:<br>
+ exception on TCP connection <0.14270.7735> from<br>
WWW.XXX.YYY.ZZZ:59106 {inet_error,enotconn}<br>
+ exception on TCP connection <0.14577.1677> from<br>
WWW.XXX.YYY.ZZZ:53163 {inet_error,enotconn}<br>
+ exception on TCP connection <0.1520.5487> from<br>
WWW.XXX.YYY.ZZZ:53435 {timeout,running}<br>
+ exception on TCP connection <0.158.6322> from<br>
WWW.XXX.YYY.ZZZ:63187<br>
{writer,send_failed,{error,<u></u>enotconn}}<br>
+ exception on TCP connection <0.17097.1918> from<br>
WWW.XXX.YYY.ZZZ:55161<br>
{writer,send_failed,{error,<u></u>enotconn}}<br>
+ exception on TCP connection <0.18340.7733> from<br>
WWW.XXX.YYY.ZZZ:52868 {inet_error,enotconn}<br>
+ exception on TCP connection <0.24514.6782> from<br>
WWW.XXX.YYY.ZZZ:64362 {timeout,blocking}<br>
+ exception on TCP connection <0.24518.6782> from<br>
WWW.XXX.YYY.ZZZ:61252 {timeout,blocking}<br>
+ exception on TCP connection <0.24524.6782> from<br>
WWW.XXX.YYY.ZZZ:55845 {timeout,blocking}<br>
+ exception on TCP connection <0.24528.6782> from<br>
WWW.XXX.YYY.ZZZ:53434 {timeout,blocking}<br>
+ exception on TCP connection <0.24532.6782> from<br>
WWW.XXX.YYY.ZZZ:54398 {timeout,blocking}<br>
+ exception on TCP connection <0.24536.6782> from<br>
WWW.XXX.YYY.ZZZ:58878 {timeout,blocking}<br>
+ exception on TCP connection <0.24552.6782> from<br>
WWW.XXX.YYY.ZZZ:63155 {timeout,blocking}<br>
+ exception on TCP connection <0.2577.2793> from<br>
WWW.XXX.YYY.ZZZ:52752<br>
{writer,send_failed,{error,<u></u>enotconn}}<br>
+ exception on TCP connection <0.26105.2580> from<br>
WWW.XXX.YYY.ZZZ:50364<br>
{writer,send_failed,{error,<u></u>enotconn}}<br>
+ exception on TCP connection <0.27505.6740> from<br>
WWW.XXX.YYY.ZZZ:56170<br>
{writer,send_failed,{error,<u></u>enotconn}}<br>
+ exception on TCP connection <0.27741.2921> from<br>
WWW.XXX.YYY.ZZZ:54600<br>
{writer,send_failed,{error,<u></u>enotconn}}<br>
+ exception on TCP connection <0.28602.6323> from<br>
WWW.XXX.YYY.ZZZ:56863<br>
{writer,send_failed,{error,<u></u>enotconn}}<br>
+ exception on TCP connection <0.30059.3135> from<br>
WWW.XXX.YYY.ZZZ:57078 {writer,send_failed,{error,<u></u>closed}}<br>
+ exception on TCP connection <0.5634.2393> from<br>
WWW.XXX.YYY.ZZZ:53807<br>
{writer,send_failed,{error,<u></u>enotconn}}<br>
+ exception on TCP connection <0.6691.6783> from<br>
WWW.XXX.YYY.ZZZ:64363 {timeout,blocking}<br>
<br>
Can anyone help?<br>
<br>
Thanks,<br>
<br>
Ian<br>
<br>
<br>
<br>
______________________________<u></u>_________________<br>
rabbitmq-discuss mailing list<br>
<a href="mailto:rabbitmq-discuss@lists.rabbitmq.com" target="_blank">rabbitmq-discuss@lists.<u></u>rabbitmq.com</a><br>
<a href="https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss" target="_blank">https://lists.rabbitmq.com/<u></u>cgi-bin/mailman/listinfo/<u></u>rabbitmq-discuss</a><span class="HOEnZb"><font color="#888888"><br>
</font></span></blockquote><span class="HOEnZb"><font color="#888888">
<br>
<br>
-- <br>
Simon MacMullen<br>
RabbitMQ, VMware<br>
</font></span></blockquote></div><br></div>