[rabbitmq-discuss] Clustered nodes failure

Ian ian.cross at gmail.com
Wed Sep 19 12:45:30 BST 2012


Hi all,

I wonder if anyone can help diagnose problems we've been having with our 
2-node clustered rabbit which sporadically seizes up completely. None of 
the applications can get through to Rabbit though it is still up and 
running. CPU and RAM spike up to 100%. The Management UI cannot be reached 
and we end up having to restart the nodes to get service back. Sometimes it 
does not come back gracefully requiring reboot.

Some stats: 

   - Both nodes are 4 Core 8GB RAM CentOS 6.2 virtual machines, running on 
   VMWare ESXi 4.1 host. We are running RabbitMQ version 2.7.1 on Erlang 
   R14B04.
   - Looking at our metrics right now I see around:
      - 1000 queues
      - 4000 channels
      - 8000 bindings
      - 16 exchanges
   - Memory usage, erlang processes, file descriptors, socket descriptors 
   are generally low and healthy

Analysing errors in the rabbit logs from a recent failure reveals:

   - Before the failure we have a bunch of background errors which may be 
   the fault of our applications like "no binding X between exchange Y in 
   vhost '/' and queue Z in vhost '/'"
   - As we ramp up to the failure we see
      - Two errors like this:
         - “** Generic server <0.16813.1677> terminating  ** Last message 
         in was {'$gen_cast',                             
         {run_backing_queue,rabbit_mirror_queue_master,                                 
         #Fun<rabbit_mirror_queue_master.4.85178772>}}  ** When Server state == 
         {lim,0,undefined,false,[],0}  ** Reason for termination ==   ** 
         {function_clause,         [{rabbit_limiter,handle_cast,              
         [{run_backing_queue,rabbit_mirror_queue_master,                   
         #Fun<rabbit_mirror_queue_master.4.85178772>},               
         {lim,0,undefined,false,[],0}]},          
         {gen_server2,handle_msg,2},          {proc_lib,init_p_do_apply,3}]}  “
      - A handful like this:
         - “connection <0.14270.7735>, channel 38 - error:  
         {amqp_error,command_invalid,"second 'channel.open' seen",'channel.open'}  “
         - A couple of these:
         - “connection <0.158.6322>, channel 135 - error:  
         {amqp_error,not_found,              "no queue 
         'InRunning.WebClient.SessionId[l0mpn3egx5n0yj0lbs1hcehj]' in vhost 
         '/'",              'basic.get'}  “
         - And then all these:
         - exception on TCP connection <0.14270.7735> from 
         WWW.XXX.YYY.ZZZ:59106  {inet_error,enotconn}  
         - exception on TCP connection <0.14577.1677> from 
         WWW.XXX.YYY.ZZZ:53163  {inet_error,enotconn}  
         - exception on TCP connection <0.1520.5487> from 
         WWW.XXX.YYY.ZZZ:53435  {timeout,running}  
         - exception on TCP connection <0.158.6322> from 
         WWW.XXX.YYY.ZZZ:63187  {writer,send_failed,{error,enotconn}}  
         - exception on TCP connection <0.17097.1918> from 
         WWW.XXX.YYY.ZZZ:55161  {writer,send_failed,{error,enotconn}}  
         - exception on TCP connection <0.18340.7733> from 
         WWW.XXX.YYY.ZZZ:52868  {inet_error,enotconn}  
         - exception on TCP connection <0.24514.6782> from 
         WWW.XXX.YYY.ZZZ:64362  {timeout,blocking}  
         - exception on TCP connection <0.24518.6782> from 
         WWW.XXX.YYY.ZZZ:61252  {timeout,blocking}  
         - exception on TCP connection <0.24524.6782> from 
         WWW.XXX.YYY.ZZZ:55845  {timeout,blocking}  
         - exception on TCP connection <0.24528.6782> from 
         WWW.XXX.YYY.ZZZ:53434  {timeout,blocking}  
         - exception on TCP connection <0.24532.6782> from 
         WWW.XXX.YYY.ZZZ:54398  {timeout,blocking}  
         - exception on TCP connection <0.24536.6782> from 
         WWW.XXX.YYY.ZZZ:58878  {timeout,blocking}  
         - exception on TCP connection <0.24552.6782> from 
         WWW.XXX.YYY.ZZZ:63155  {timeout,blocking}  
         - exception on TCP connection <0.2577.2793> from 
         WWW.XXX.YYY.ZZZ:52752  {writer,send_failed,{error,enotconn}}  
         - exception on TCP connection <0.26105.2580> from 
         WWW.XXX.YYY.ZZZ:50364  {writer,send_failed,{error,enotconn}}  
         - exception on TCP connection <0.27505.6740> from 
         WWW.XXX.YYY.ZZZ:56170  {writer,send_failed,{error,enotconn}}  
         - exception on TCP connection <0.27741.2921> from 
         WWW.XXX.YYY.ZZZ:54600  {writer,send_failed,{error,enotconn}}  
         - exception on TCP connection <0.28602.6323> from 
         WWW.XXX.YYY.ZZZ:56863  {writer,send_failed,{error,enotconn}}  
         - exception on TCP connection <0.30059.3135> from 
         WWW.XXX.YYY.ZZZ:57078  {writer,send_failed,{error,closed}}  
         - exception on TCP connection <0.5634.2393> from 
         WWW.XXX.YYY.ZZZ:53807  {writer,send_failed,{error,enotconn}}  
         - exception on TCP connection <0.6691.6783> from 
         WWW.XXX.YYY.ZZZ:64363  {timeout,blocking}  
      
Can anyone help?

Thanks,

Ian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120919/d3482850/attachment.htm>


More information about the rabbitmq-discuss mailing list