[rabbitmq-discuss] Clustered nodes failure
Ian
ian.cross at gmail.com
Wed Sep 19 12:45:30 BST 2012
Hi all,
I wonder if anyone can help diagnose problems we've been having with our
2-node clustered rabbit which sporadically seizes up completely. None of
the applications can get through to Rabbit though it is still up and
running. CPU and RAM spike up to 100%. The Management UI cannot be reached
and we end up having to restart the nodes to get service back. Sometimes it
does not come back gracefully requiring reboot.
Some stats:
- Both nodes are 4 Core 8GB RAM CentOS 6.2 virtual machines, running on
VMWare ESXi 4.1 host. We are running RabbitMQ version 2.7.1 on Erlang
R14B04.
- Looking at our metrics right now I see around:
- 1000 queues
- 4000 channels
- 8000 bindings
- 16 exchanges
- Memory usage, erlang processes, file descriptors, socket descriptors
are generally low and healthy
Analysing errors in the rabbit logs from a recent failure reveals:
- Before the failure we have a bunch of background errors which may be
the fault of our applications like "no binding X between exchange Y in
vhost '/' and queue Z in vhost '/'"
- As we ramp up to the failure we see
- Two errors like this:
- “** Generic server <0.16813.1677> terminating ** Last message
in was {'$gen_cast',
{run_backing_queue,rabbit_mirror_queue_master,
#Fun<rabbit_mirror_queue_master.4.85178772>}} ** When Server state ==
{lim,0,undefined,false,[],0} ** Reason for termination == **
{function_clause, [{rabbit_limiter,handle_cast,
[{run_backing_queue,rabbit_mirror_queue_master,
#Fun<rabbit_mirror_queue_master.4.85178772>},
{lim,0,undefined,false,[],0}]},
{gen_server2,handle_msg,2}, {proc_lib,init_p_do_apply,3}]} “
- A handful like this:
- “connection <0.14270.7735>, channel 38 - error:
{amqp_error,command_invalid,"second 'channel.open' seen",'channel.open'} “
- A couple of these:
- “connection <0.158.6322>, channel 135 - error:
{amqp_error,not_found, "no queue
'InRunning.WebClient.SessionId[l0mpn3egx5n0yj0lbs1hcehj]' in vhost
'/'", 'basic.get'} “
- And then all these:
- exception on TCP connection <0.14270.7735> from
WWW.XXX.YYY.ZZZ:59106 {inet_error,enotconn}
- exception on TCP connection <0.14577.1677> from
WWW.XXX.YYY.ZZZ:53163 {inet_error,enotconn}
- exception on TCP connection <0.1520.5487> from
WWW.XXX.YYY.ZZZ:53435 {timeout,running}
- exception on TCP connection <0.158.6322> from
WWW.XXX.YYY.ZZZ:63187 {writer,send_failed,{error,enotconn}}
- exception on TCP connection <0.17097.1918> from
WWW.XXX.YYY.ZZZ:55161 {writer,send_failed,{error,enotconn}}
- exception on TCP connection <0.18340.7733> from
WWW.XXX.YYY.ZZZ:52868 {inet_error,enotconn}
- exception on TCP connection <0.24514.6782> from
WWW.XXX.YYY.ZZZ:64362 {timeout,blocking}
- exception on TCP connection <0.24518.6782> from
WWW.XXX.YYY.ZZZ:61252 {timeout,blocking}
- exception on TCP connection <0.24524.6782> from
WWW.XXX.YYY.ZZZ:55845 {timeout,blocking}
- exception on TCP connection <0.24528.6782> from
WWW.XXX.YYY.ZZZ:53434 {timeout,blocking}
- exception on TCP connection <0.24532.6782> from
WWW.XXX.YYY.ZZZ:54398 {timeout,blocking}
- exception on TCP connection <0.24536.6782> from
WWW.XXX.YYY.ZZZ:58878 {timeout,blocking}
- exception on TCP connection <0.24552.6782> from
WWW.XXX.YYY.ZZZ:63155 {timeout,blocking}
- exception on TCP connection <0.2577.2793> from
WWW.XXX.YYY.ZZZ:52752 {writer,send_failed,{error,enotconn}}
- exception on TCP connection <0.26105.2580> from
WWW.XXX.YYY.ZZZ:50364 {writer,send_failed,{error,enotconn}}
- exception on TCP connection <0.27505.6740> from
WWW.XXX.YYY.ZZZ:56170 {writer,send_failed,{error,enotconn}}
- exception on TCP connection <0.27741.2921> from
WWW.XXX.YYY.ZZZ:54600 {writer,send_failed,{error,enotconn}}
- exception on TCP connection <0.28602.6323> from
WWW.XXX.YYY.ZZZ:56863 {writer,send_failed,{error,enotconn}}
- exception on TCP connection <0.30059.3135> from
WWW.XXX.YYY.ZZZ:57078 {writer,send_failed,{error,closed}}
- exception on TCP connection <0.5634.2393> from
WWW.XXX.YYY.ZZZ:53807 {writer,send_failed,{error,enotconn}}
- exception on TCP connection <0.6691.6783> from
WWW.XXX.YYY.ZZZ:64363 {timeout,blocking}
Can anyone help?
Thanks,
Ian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120919/d3482850/attachment.htm>
More information about the rabbitmq-discuss
mailing list