Hi all,<div><br></div><div>I wonder if anyone can help diagnose problems we've been having with our 2-node clustered rabbit which sporadically seizes up completely. None of the applications can get through to Rabbit though it is still up and running. CPU and RAM spike up to 100%. The Management UI cannot be reached and we end up having to restart the nodes to get service back. Sometimes it does not come back gracefully requiring reboot.</div><div><br></div><div>Some stats: </div><div><ul><li>Both nodes are 4 Core 8GB RAM CentOS 6.2 virtual machines, running on VMWare ESXi 4.1 host. We are running RabbitMQ version 2.7.1 on Erlang R14B04.</li><li>Looking at our metrics right now I see around:</li><ul><li>1000 queues</li><li>4000 channels</li><li>8000 bindings</li><li>16 exchanges</li></ul><li>Memory usage, erlang processes, file descriptors, socket descriptors are generally low and healthy</li></ul><div><span style="line-height: 17px; ">Analysing errors in the rabbit logs from a recent failure reveals:</span><br></div></div><div><ul><li>Before the failure we have a bunch of background errors which may be the fault of our applications like "no binding X between exchange Y in vhost '/' and queue Z in vhost '/'"</li><li>As we ramp up to the failure we see</li><ul><li>Two errors like this:</li><ul><li><span lang="EN-US" style="font-size:11.0pt;font-family:
"Calibri","sans-serif";mso-ascii-theme-font:minor-latin;mso-fareast-font-family:
Calibri;mso-fareast-theme-font:minor-latin;mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:"Times New Roman";mso-bidi-theme-font:minor-bidi;
mso-ansi-language:EN-US;mso-fareast-language:EN-US;mso-bidi-language:AR-SA">“**
Generic server <0.16813.1677> terminating
** Last message in was {'$gen_cast',
{run_backing_queue,rabbit_mirror_queue_master,
#Fun<rabbit_mirror_queue_master.4.85178772>}} ** When Server state ==
{lim,0,undefined,false,[],0} ** Reason
for termination == ** {function_clause, [{rabbit_limiter,handle_cast,
[{run_backing_queue,rabbit_mirror_queue_master, #Fun<rabbit_mirror_queue_master.4.85178772>},
{lim,0,undefined,false,[],0}]},
{gen_server2,handle_msg,2},
{proc_lib,init_p_do_apply,3}]} “</span></li></ul><li><font face="Calibri, sans-serif"><span style="font-size: 15px;">A handful like this:</span></font></li><ul><li><font face="Calibri, sans-serif"><span style="font-size: 15px;"><span lang="EN-US" style="font-size: 11pt; ">“connection
<0.14270.7735>, channel 38 - error:
{amqp_error,command_invalid,"second 'channel.open'
seen",'channel.open'} “</span><br></span></font></li></ul><li><font face="Calibri, sans-serif"><span style="font-size: 15px;">A couple of these:</span></font></li><ul><li><font face="Calibri, sans-serif"><span style="font-size: 15px;"><span lang="EN-US" style="font-size: 11pt; ">“connection
<0.158.6322>, channel 135 - error:
{amqp_error,not_found,
"no queue 'InRunning.WebClient.SessionId[l0mpn3egx5n0yj0lbs1hcehj]'
in vhost '/'",
'basic.get'} “</span><br></span></font></li></ul><li><font face="Calibri, sans-serif"><span style="font-size: 15px;">And then all these:</span></font></li><ul><li>exception on TCP connection <0.14270.7735> from WWW.XXX.YYY.ZZZ:59106 {inet_error,enotconn} </li><li>exception on TCP connection <0.14577.1677> from WWW.XXX.YYY.ZZZ:53163 {inet_error,enotconn} </li><li>exception on TCP connection <0.1520.5487> from WWW.XXX.YYY.ZZZ:53435 {timeout,running} </li><li>exception on TCP connection <0.158.6322> from WWW.XXX.YYY.ZZZ:63187 {writer,send_failed,{error,enotconn}} </li><li>exception on TCP connection <0.17097.1918> from WWW.XXX.YYY.ZZZ:55161 {writer,send_failed,{error,enotconn}} </li><li>exception on TCP connection <0.18340.7733> from WWW.XXX.YYY.ZZZ:52868 {inet_error,enotconn} </li><li>exception on TCP connection <0.24514.6782> from WWW.XXX.YYY.ZZZ:64362 {timeout,blocking} </li><li>exception on TCP connection <0.24518.6782> from WWW.XXX.YYY.ZZZ:61252 {timeout,blocking} </li><li>exception on TCP connection <0.24524.6782> from WWW.XXX.YYY.ZZZ:55845 {timeout,blocking} </li><li>exception on TCP connection <0.24528.6782> from WWW.XXX.YYY.ZZZ:53434 {timeout,blocking} </li><li>exception on TCP connection <0.24532.6782> from WWW.XXX.YYY.ZZZ:54398 {timeout,blocking} </li><li>exception on TCP connection <0.24536.6782> from WWW.XXX.YYY.ZZZ:58878 {timeout,blocking} </li><li>exception on TCP connection <0.24552.6782> from WWW.XXX.YYY.ZZZ:63155 {timeout,blocking} </li><li>exception on TCP connection <0.2577.2793> from WWW.XXX.YYY.ZZZ:52752 {writer,send_failed,{error,enotconn}} </li><li>exception on TCP connection <0.26105.2580> from WWW.XXX.YYY.ZZZ:50364 {writer,send_failed,{error,enotconn}} </li><li>exception on TCP connection <0.27505.6740> from WWW.XXX.YYY.ZZZ:56170 {writer,send_failed,{error,enotconn}} </li><li>exception on TCP connection <0.27741.2921> from WWW.XXX.YYY.ZZZ:54600 {writer,send_failed,{error,enotconn}} </li><li>exception on TCP connection <0.28602.6323> from WWW.XXX.YYY.ZZZ:56863 {writer,send_failed,{error,enotconn}} </li><li>exception on TCP connection <0.30059.3135> from WWW.XXX.YYY.ZZZ:57078 {writer,send_failed,{error,closed}} </li><li>exception on TCP connection <0.5634.2393> from WWW.XXX.YYY.ZZZ:53807 {writer,send_failed,{error,enotconn}} </li><li>exception on TCP connection <0.6691.6783> from WWW.XXX.YYY.ZZZ:64363 {timeout,blocking} </li></ul></ul></ul><div>Can anyone help?</div></div><div><br></div><div>Thanks,</div><div><br></div><div>Ian</div>