[rabbitmq-discuss] Active/Active: shutdown of one service brings down the cluster

Jerry Kuch jerryk at vmware.com
Tue Feb 14 20:45:17 GMT 2012

Hi, Vadim:

A client doesn't need to be connected directly to the node on which
a queue and its attendant Erlang process reside.  If your load balancer
sends you to any live node in the cluster you can consume from the queue
of your choice, as long as it's still alive.

Note that you can't straightforwardly redeclare a queue that had been on
a node that's gone down.  The cluster's metadata will still know about it
and prevent you from redeclaring it.  This is intentional, to avoid the
confusion that would result if you succeeded at the redeclare, a new
queue of the same name and properties came into existence on another node,
and then the original, downed node came back up in the cluster...

Best regards,

----- Original Message -----
From: "Vadim Chekan" <kot.begemot at gmail.com>
To: "Simon MacMullen" <simon at rabbitmq.com>, ghanna at verticalsearchworks.com
Cc: rabbitmq-discuss at lists.rabbitmq.com
Sent: Tuesday, February 14, 2012 12:40:05 PM
Subject: Re: [rabbitmq-discuss] Active/Active: shutdown of one service brings down the cluster

Hi Simon, 

Thanks for looking into the logs. Since we fixed channel leak in our application we do not experience any problems anymore. 

Regarding transient queues in HA. I am just not sure how system would behave when non-HA queue is declared in a cluster environment. Documentation describes in great details what happen to mirrored queues but I can't find anything about non-ha queue in HA cluster. Queue will be created on a single server, and application should be ready to re-declare queue in case of failover. So far so good. But how does it work with load balancer? When request is made against a server which does not have a given queue, will the cluster "know" where the queue is and proxy the request to the proper server? 


On Mon, Feb 13, 2012 at 10:09 AM, Simon MacMullen < simon at rabbitmq.com > wrote: 

On 10/02/12 00:20, Vadim Chekan wrote: 

I think we nailed down a problem. We had a channel leak in our 
application. With ~50 connections we had >90 channels per connection and 
growing. This definitely correlates to high CPU usage. 

What I still do not understand either it triggered rabbit into unstable 
state or it was something else. Maybe increasing latencies in message 
handling triggered cluster members into flipping neighbor aliveness 
status back and force? Just speculating here: could timeouts because of 
high load cause network fragmentation, when every node temporally does 
not see neighbors, becomes a master, than see a neighbor, freak out, etc? 

That's plausible, but I don't think that's what's happening (there's nothing about network partitioning in the logs). 

I've attached logs from all 3 cluster members. They are polluted with 
load balancer "ping". 

Thanks. I've had a poke at this but nothing is leaping out at me yet. I'll keep at it though. 

One thing that's a bit odd: you seem to be creating HA / transient / autodelete / exclusive queues. So although they're "HA", they will vanish if any of the following happens: 

* The entire cluster goes down (transient) or 
* All consumers for a queue cancel (autodelete) or 
* The connection that created them closes (exclusive) 

Is this intentional? It seems like an odd use of HA. 

Cheers, Simon 

Simon MacMullen 
RabbitMQ, VMware 


More information about the rabbitmq-discuss mailing list