[rabbitmq-discuss] Clustered nodes failure
Simon MacMullen
simon at rabbitmq.com
Wed Sep 19 13:08:31 BST 2012
Hi Ian.
We've fixed quite a lot of bugs in mirrored queues since 2.7.1. So I
would have to suggest an upgrade to 2.8.6 first of all.
Cheers, Simon
On 19/09/12 12:45, Ian wrote:
> Hi all,
>
> I wonder if anyone can help diagnose problems we've been having with our
> 2-node clustered rabbit which sporadically seizes up completely. None of
> the applications can get through to Rabbit though it is still up and
> running. CPU and RAM spike up to 100%. The Management UI cannot be
> reached and we end up having to restart the nodes to get service back.
> Sometimes it does not come back gracefully requiring reboot.
>
> Some stats:
>
> * Both nodes are 4 Core 8GB RAM CentOS 6.2 virtual machines, running
> on VMWare ESXi 4.1 host. We are running RabbitMQ version 2.7.1 on
> Erlang R14B04.
> * Looking at our metrics right now I see around:
> o 1000 queues
> o 4000 channels
> o 8000 bindings
> o 16 exchanges
> * Memory usage, erlang processes, file descriptors, socket
> descriptors are generally low and healthy
>
> Analysing errors in the rabbit logs from a recent failure reveals:
>
> * Before the failure we have a bunch of background errors which may
> be the fault of our applications like "no binding X between
> exchange Y in vhost '/' and queue Z in vhost '/'"
> * As we ramp up to the failure we see
> o Two errors like this:
> + “** Generic server <0.16813.1677> terminating ** Last
> message in was {'$gen_cast',
> {run_backing_queue,rabbit_mirror_queue_master,
> #Fun<rabbit_mirror_queue_master.4.85178772>}} ** When
> Server state == {lim,0,undefined,false,[],0} ** Reason
> for termination == ** {function_clause,
> [{rabbit_limiter,handle_cast,
> [{run_backing_queue,rabbit_mirror_queue_master,
> #Fun<rabbit_mirror_queue_master.4.85178772>},
> {lim,0,undefined,false,[],0}]},
> {gen_server2,handle_msg,2},
> {proc_lib,init_p_do_apply,3}]} “
> o A handful like this:
> + “connection <0.14270.7735>, channel 38 - error:
> {amqp_error,command_invalid,"second 'channel.open'
> seen",'channel.open'} “
> o A couple of these:
> + “connection <0.158.6322>, channel 135 - error:
> {amqp_error,not_found, "no queue
> 'InRunning.WebClient.SessionId[l0mpn3egx5n0yj0lbs1hcehj]'
> in vhost '/'", 'basic.get'} “
> o And then all these:
> + exception on TCP connection <0.14270.7735> from
> WWW.XXX.YYY.ZZZ:59106 {inet_error,enotconn}
> + exception on TCP connection <0.14577.1677> from
> WWW.XXX.YYY.ZZZ:53163 {inet_error,enotconn}
> + exception on TCP connection <0.1520.5487> from
> WWW.XXX.YYY.ZZZ:53435 {timeout,running}
> + exception on TCP connection <0.158.6322> from
> WWW.XXX.YYY.ZZZ:63187
> {writer,send_failed,{error,enotconn}}
> + exception on TCP connection <0.17097.1918> from
> WWW.XXX.YYY.ZZZ:55161
> {writer,send_failed,{error,enotconn}}
> + exception on TCP connection <0.18340.7733> from
> WWW.XXX.YYY.ZZZ:52868 {inet_error,enotconn}
> + exception on TCP connection <0.24514.6782> from
> WWW.XXX.YYY.ZZZ:64362 {timeout,blocking}
> + exception on TCP connection <0.24518.6782> from
> WWW.XXX.YYY.ZZZ:61252 {timeout,blocking}
> + exception on TCP connection <0.24524.6782> from
> WWW.XXX.YYY.ZZZ:55845 {timeout,blocking}
> + exception on TCP connection <0.24528.6782> from
> WWW.XXX.YYY.ZZZ:53434 {timeout,blocking}
> + exception on TCP connection <0.24532.6782> from
> WWW.XXX.YYY.ZZZ:54398 {timeout,blocking}
> + exception on TCP connection <0.24536.6782> from
> WWW.XXX.YYY.ZZZ:58878 {timeout,blocking}
> + exception on TCP connection <0.24552.6782> from
> WWW.XXX.YYY.ZZZ:63155 {timeout,blocking}
> + exception on TCP connection <0.2577.2793> from
> WWW.XXX.YYY.ZZZ:52752
> {writer,send_failed,{error,enotconn}}
> + exception on TCP connection <0.26105.2580> from
> WWW.XXX.YYY.ZZZ:50364
> {writer,send_failed,{error,enotconn}}
> + exception on TCP connection <0.27505.6740> from
> WWW.XXX.YYY.ZZZ:56170
> {writer,send_failed,{error,enotconn}}
> + exception on TCP connection <0.27741.2921> from
> WWW.XXX.YYY.ZZZ:54600
> {writer,send_failed,{error,enotconn}}
> + exception on TCP connection <0.28602.6323> from
> WWW.XXX.YYY.ZZZ:56863
> {writer,send_failed,{error,enotconn}}
> + exception on TCP connection <0.30059.3135> from
> WWW.XXX.YYY.ZZZ:57078 {writer,send_failed,{error,closed}}
> + exception on TCP connection <0.5634.2393> from
> WWW.XXX.YYY.ZZZ:53807
> {writer,send_failed,{error,enotconn}}
> + exception on TCP connection <0.6691.6783> from
> WWW.XXX.YYY.ZZZ:64363 {timeout,blocking}
>
> Can anyone help?
>
> Thanks,
>
> Ian
>
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
--
Simon MacMullen
RabbitMQ, VMware
More information about the rabbitmq-discuss
mailing list