[rabbitmq-discuss] Clustered nodes failure

Ian Cross ian.cross at gmail.com
Wed Sep 19 19:47:36 BST 2012


Hi Simon,

Thanks for that - we'll upgrade this cluster to 2.8.6 as you suggest and
let you know how we get on.

Ian.

On Wed, Sep 19, 2012 at 1:08 PM, Simon MacMullen <simon at rabbitmq.com> wrote:

> Hi Ian.
>
> We've fixed quite a lot of bugs in mirrored queues since 2.7.1. So I would
> have to suggest an upgrade to 2.8.6 first of all.
>
> Cheers, Simon
>
> On 19/09/12 12:45, Ian wrote:
>
>> Hi all,
>>
>> I wonder if anyone can help diagnose problems we've been having with our
>> 2-node clustered rabbit which sporadically seizes up completely. None of
>> the applications can get through to Rabbit though it is still up and
>> running. CPU and RAM spike up to 100%. The Management UI cannot be
>> reached and we end up having to restart the nodes to get service back.
>> Sometimes it does not come back gracefully requiring reboot.
>>
>> Some stats:
>>
>>     * Both nodes are 4 Core 8GB RAM CentOS 6.2 virtual machines, running
>>       on VMWare ESXi 4.1 host. We are running RabbitMQ version 2.7.1 on
>>       Erlang R14B04.
>>     * Looking at our metrics right now I see around:
>>           o 1000 queues
>>           o 4000 channels
>>           o 8000 bindings
>>           o 16 exchanges
>>     * Memory usage, erlang processes, file descriptors, socket
>>       descriptors are generally low and healthy
>>
>> Analysing errors in the rabbit logs from a recent failure reveals:
>>
>>     * Before the failure we have a bunch of background errors which may
>>       be the fault of our applications like "no binding X between
>>       exchange Y in vhost '/' and queue Z in vhost '/'"
>>     * As we ramp up to the failure we see
>>           o Two errors like this:
>>                 + “** Generic server <0.16813.1677> terminating ** Last
>>                   message in was {'$gen_cast',
>>                   {run_backing_queue,rabbit_**mirror_queue_master,
>>                   #Fun<rabbit_mirror_queue_**master.4.85178772>}} ** When
>>                   Server state == {lim,0,undefined,false,[],0} ** Reason
>>                   for termination == ** {function_clause,
>>                   [{rabbit_limiter,handle_cast,
>>                   [{run_backing_queue,rabbit_**mirror_queue_master,
>>                   #Fun<rabbit_mirror_queue_**master.4.85178772>},
>>                   {lim,0,undefined,false,[],0}]}**,
>>                   {gen_server2,handle_msg,2},
>>                   {proc_lib,init_p_do_apply,3}]} “
>>           o A handful like this:
>>                 + “connection <0.14270.7735>, channel 38 - error:
>>                   {amqp_error,command_invalid,"**second 'channel.open'
>>                   seen",'channel.open'} “
>>           o A couple of these:
>>                 + “connection <0.158.6322>, channel 135 - error:
>>                   {amqp_error,not_found, "no queue
>>                   'InRunning.WebClient.**SessionId[**
>> l0mpn3egx5n0yj0lbs1hcehj]'
>>                   in vhost '/'", 'basic.get'} “
>>           o And then all these:
>>                 + exception on TCP connection <0.14270.7735> from
>>                   WWW.XXX.YYY.ZZZ:59106 {inet_error,enotconn}
>>                 + exception on TCP connection <0.14577.1677> from
>>                   WWW.XXX.YYY.ZZZ:53163 {inet_error,enotconn}
>>                 + exception on TCP connection <0.1520.5487> from
>>                   WWW.XXX.YYY.ZZZ:53435 {timeout,running}
>>                 + exception on TCP connection <0.158.6322> from
>>                   WWW.XXX.YYY.ZZZ:63187
>>                   {writer,send_failed,{error,**enotconn}}
>>                 + exception on TCP connection <0.17097.1918> from
>>                   WWW.XXX.YYY.ZZZ:55161
>>                   {writer,send_failed,{error,**enotconn}}
>>                 + exception on TCP connection <0.18340.7733> from
>>                   WWW.XXX.YYY.ZZZ:52868 {inet_error,enotconn}
>>                 + exception on TCP connection <0.24514.6782> from
>>                   WWW.XXX.YYY.ZZZ:64362 {timeout,blocking}
>>                 + exception on TCP connection <0.24518.6782> from
>>                   WWW.XXX.YYY.ZZZ:61252 {timeout,blocking}
>>                 + exception on TCP connection <0.24524.6782> from
>>                   WWW.XXX.YYY.ZZZ:55845 {timeout,blocking}
>>                 + exception on TCP connection <0.24528.6782> from
>>                   WWW.XXX.YYY.ZZZ:53434 {timeout,blocking}
>>                 + exception on TCP connection <0.24532.6782> from
>>                   WWW.XXX.YYY.ZZZ:54398 {timeout,blocking}
>>                 + exception on TCP connection <0.24536.6782> from
>>                   WWW.XXX.YYY.ZZZ:58878 {timeout,blocking}
>>                 + exception on TCP connection <0.24552.6782> from
>>                   WWW.XXX.YYY.ZZZ:63155 {timeout,blocking}
>>                 + exception on TCP connection <0.2577.2793> from
>>                   WWW.XXX.YYY.ZZZ:52752
>>                   {writer,send_failed,{error,**enotconn}}
>>                 + exception on TCP connection <0.26105.2580> from
>>                   WWW.XXX.YYY.ZZZ:50364
>>                   {writer,send_failed,{error,**enotconn}}
>>                 + exception on TCP connection <0.27505.6740> from
>>                   WWW.XXX.YYY.ZZZ:56170
>>                   {writer,send_failed,{error,**enotconn}}
>>                 + exception on TCP connection <0.27741.2921> from
>>                   WWW.XXX.YYY.ZZZ:54600
>>                   {writer,send_failed,{error,**enotconn}}
>>                 + exception on TCP connection <0.28602.6323> from
>>                   WWW.XXX.YYY.ZZZ:56863
>>                   {writer,send_failed,{error,**enotconn}}
>>                 + exception on TCP connection <0.30059.3135> from
>>                   WWW.XXX.YYY.ZZZ:57078 {writer,send_failed,{error,**
>> closed}}
>>                 + exception on TCP connection <0.5634.2393> from
>>                   WWW.XXX.YYY.ZZZ:53807
>>                   {writer,send_failed,{error,**enotconn}}
>>                 + exception on TCP connection <0.6691.6783> from
>>                   WWW.XXX.YYY.ZZZ:64363 {timeout,blocking}
>>
>> Can anyone help?
>>
>> Thanks,
>>
>> Ian
>>
>>
>>
>> ______________________________**_________________
>> rabbitmq-discuss mailing list
>> rabbitmq-discuss at lists.**rabbitmq.com<rabbitmq-discuss at lists.rabbitmq.com>
>> https://lists.rabbitmq.com/**cgi-bin/mailman/listinfo/**rabbitmq-discuss<https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss>
>>
>
>
> --
> Simon MacMullen
> RabbitMQ, VMware
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120919/cc135e8b/attachment.htm>


More information about the rabbitmq-discuss mailing list