[rabbitmq-discuss] Clustered nodes failure
Ian Cross
ian.cross at gmail.com
Wed Sep 19 19:47:36 BST 2012
Hi Simon,
Thanks for that - we'll upgrade this cluster to 2.8.6 as you suggest and
let you know how we get on.
Ian.
On Wed, Sep 19, 2012 at 1:08 PM, Simon MacMullen <simon at rabbitmq.com> wrote:
> Hi Ian.
>
> We've fixed quite a lot of bugs in mirrored queues since 2.7.1. So I would
> have to suggest an upgrade to 2.8.6 first of all.
>
> Cheers, Simon
>
> On 19/09/12 12:45, Ian wrote:
>
>> Hi all,
>>
>> I wonder if anyone can help diagnose problems we've been having with our
>> 2-node clustered rabbit which sporadically seizes up completely. None of
>> the applications can get through to Rabbit though it is still up and
>> running. CPU and RAM spike up to 100%. The Management UI cannot be
>> reached and we end up having to restart the nodes to get service back.
>> Sometimes it does not come back gracefully requiring reboot.
>>
>> Some stats:
>>
>> * Both nodes are 4 Core 8GB RAM CentOS 6.2 virtual machines, running
>> on VMWare ESXi 4.1 host. We are running RabbitMQ version 2.7.1 on
>> Erlang R14B04.
>> * Looking at our metrics right now I see around:
>> o 1000 queues
>> o 4000 channels
>> o 8000 bindings
>> o 16 exchanges
>> * Memory usage, erlang processes, file descriptors, socket
>> descriptors are generally low and healthy
>>
>> Analysing errors in the rabbit logs from a recent failure reveals:
>>
>> * Before the failure we have a bunch of background errors which may
>> be the fault of our applications like "no binding X between
>> exchange Y in vhost '/' and queue Z in vhost '/'"
>> * As we ramp up to the failure we see
>> o Two errors like this:
>> + “** Generic server <0.16813.1677> terminating ** Last
>> message in was {'$gen_cast',
>> {run_backing_queue,rabbit_**mirror_queue_master,
>> #Fun<rabbit_mirror_queue_**master.4.85178772>}} ** When
>> Server state == {lim,0,undefined,false,[],0} ** Reason
>> for termination == ** {function_clause,
>> [{rabbit_limiter,handle_cast,
>> [{run_backing_queue,rabbit_**mirror_queue_master,
>> #Fun<rabbit_mirror_queue_**master.4.85178772>},
>> {lim,0,undefined,false,[],0}]}**,
>> {gen_server2,handle_msg,2},
>> {proc_lib,init_p_do_apply,3}]} “
>> o A handful like this:
>> + “connection <0.14270.7735>, channel 38 - error:
>> {amqp_error,command_invalid,"**second 'channel.open'
>> seen",'channel.open'} “
>> o A couple of these:
>> + “connection <0.158.6322>, channel 135 - error:
>> {amqp_error,not_found, "no queue
>> 'InRunning.WebClient.**SessionId[**
>> l0mpn3egx5n0yj0lbs1hcehj]'
>> in vhost '/'", 'basic.get'} “
>> o And then all these:
>> + exception on TCP connection <0.14270.7735> from
>> WWW.XXX.YYY.ZZZ:59106 {inet_error,enotconn}
>> + exception on TCP connection <0.14577.1677> from
>> WWW.XXX.YYY.ZZZ:53163 {inet_error,enotconn}
>> + exception on TCP connection <0.1520.5487> from
>> WWW.XXX.YYY.ZZZ:53435 {timeout,running}
>> + exception on TCP connection <0.158.6322> from
>> WWW.XXX.YYY.ZZZ:63187
>> {writer,send_failed,{error,**enotconn}}
>> + exception on TCP connection <0.17097.1918> from
>> WWW.XXX.YYY.ZZZ:55161
>> {writer,send_failed,{error,**enotconn}}
>> + exception on TCP connection <0.18340.7733> from
>> WWW.XXX.YYY.ZZZ:52868 {inet_error,enotconn}
>> + exception on TCP connection <0.24514.6782> from
>> WWW.XXX.YYY.ZZZ:64362 {timeout,blocking}
>> + exception on TCP connection <0.24518.6782> from
>> WWW.XXX.YYY.ZZZ:61252 {timeout,blocking}
>> + exception on TCP connection <0.24524.6782> from
>> WWW.XXX.YYY.ZZZ:55845 {timeout,blocking}
>> + exception on TCP connection <0.24528.6782> from
>> WWW.XXX.YYY.ZZZ:53434 {timeout,blocking}
>> + exception on TCP connection <0.24532.6782> from
>> WWW.XXX.YYY.ZZZ:54398 {timeout,blocking}
>> + exception on TCP connection <0.24536.6782> from
>> WWW.XXX.YYY.ZZZ:58878 {timeout,blocking}
>> + exception on TCP connection <0.24552.6782> from
>> WWW.XXX.YYY.ZZZ:63155 {timeout,blocking}
>> + exception on TCP connection <0.2577.2793> from
>> WWW.XXX.YYY.ZZZ:52752
>> {writer,send_failed,{error,**enotconn}}
>> + exception on TCP connection <0.26105.2580> from
>> WWW.XXX.YYY.ZZZ:50364
>> {writer,send_failed,{error,**enotconn}}
>> + exception on TCP connection <0.27505.6740> from
>> WWW.XXX.YYY.ZZZ:56170
>> {writer,send_failed,{error,**enotconn}}
>> + exception on TCP connection <0.27741.2921> from
>> WWW.XXX.YYY.ZZZ:54600
>> {writer,send_failed,{error,**enotconn}}
>> + exception on TCP connection <0.28602.6323> from
>> WWW.XXX.YYY.ZZZ:56863
>> {writer,send_failed,{error,**enotconn}}
>> + exception on TCP connection <0.30059.3135> from
>> WWW.XXX.YYY.ZZZ:57078 {writer,send_failed,{error,**
>> closed}}
>> + exception on TCP connection <0.5634.2393> from
>> WWW.XXX.YYY.ZZZ:53807
>> {writer,send_failed,{error,**enotconn}}
>> + exception on TCP connection <0.6691.6783> from
>> WWW.XXX.YYY.ZZZ:64363 {timeout,blocking}
>>
>> Can anyone help?
>>
>> Thanks,
>>
>> Ian
>>
>>
>>
>> ______________________________**_________________
>> rabbitmq-discuss mailing list
>> rabbitmq-discuss at lists.**rabbitmq.com<rabbitmq-discuss at lists.rabbitmq.com>
>> https://lists.rabbitmq.com/**cgi-bin/mailman/listinfo/**rabbitmq-discuss<https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss>
>>
>
>
> --
> Simon MacMullen
> RabbitMQ, VMware
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120919/cc135e8b/attachment.htm>
More information about the rabbitmq-discuss
mailing list