[rabbitmq-discuss] Active/Active: shutdown of one service brings down the cluster

Wed Feb 15 22:09:30 GMT 2012

Hi, Vadim.

I apologize if I'm misunderstanding you...  I'm not entirely sure why you'd
want a *transient* queue to be HA.  Unless you've developed a mistaken 
understanding that reading from a queue when you're connected to a particular
cluster node, N, requires a mirror of that queue on the node N:  it, in fact
does not, and Rabbit will get messages from the queue internally and deliver
them to your consumer on whichever node it's connected to.

So if I'm reading you correctly and your scenario is that you want to declare 
transient queues and access them from any cluster node, then you don't have
any extra work to do.  Just declare that transient queue, period, and access
it freely.  You don't have to worry about where it is, and you'd use HA only
if you wanted the queue to remain available if the node it lived on went down.

Take a look at:  http://www.rabbitmq.com/clustering.html

It summarizes what data lives where in a cluster.  In particular, that 
"All data/state required for the operation of a RabbitMQ broker is replicated 
across all nodes, for reliability and scaling, with full ACID properties. 
An exception to this are message queues, which by default reside on the node 
that created them, though *they are visible and reachable from all nodes*."

Make sense?

Best regards,
Jerry

~~~~]
: 

----- Original Message -----
From: "Vadim Chekan" <kot.begemot at gmail.com>
To: "Jerry Kuch" <jerryk at vmware.com>
Cc: rabbitmq-discuss at lists.rabbitmq.com
Sent: Wednesday, February 15, 2012 12:11:45 PM
Subject: Re: [rabbitmq-discuss] Active/Active: shutdown of one service brings down the cluster

Hi Jerry, 

So is there a better way to declare transient queues then declaring them as HA? 
I can see only alternative by adding a random string to queue name. And which way is preferred? 

Vadim. 

On Tue, Feb 14, 2012 at 12:45 PM, Jerry Kuch < jerryk at vmware.com > wrote: 

Hi, Vadim: 

A client doesn't need to be connected directly to the node on which 
a queue and its attendant Erlang process reside. If your load balancer 
sends you to any live node in the cluster you can consume from the queue 
of your choice, as long as it's still alive. 

Note that you can't straightforwardly redeclare a queue that had been on 
a node that's gone down. The cluster's metadata will still know about it 
and prevent you from redeclaring it. This is intentional, to avoid the 
confusion that would result if you succeeded at the redeclare, a new 
queue of the same name and properties came into existence on another node, 
and then the original, downed node came back up in the cluster... 

Best regards, 
Jerry 

----- Original Message ----- 
From: "Vadim Chekan" < kot.begemot at gmail.com > 
To: "Simon MacMullen" < simon at rabbitmq.com >, ghanna at verticalsearchworks.com 
Cc: rabbitmq-discuss at lists.rabbitmq.com 
Sent: Tuesday, February 14, 2012 12:40:05 PM 
Subject: Re: [rabbitmq-discuss] Active/Active: shutdown of one service brings down the cluster 

Hi Simon, 

Thanks for looking into the logs. Since we fixed channel leak in our application we do not experience any problems anymore. 

Regarding transient queues in HA. I am just not sure how system would behave when non-HA queue is declared in a cluster environment. Documentation describes in great details what happen to mirrored queues but I can't find anything about non-ha queue in HA cluster. Queue will be created on a single server, and application should be ready to re-declare queue in case of failover. So far so good. But how does it work with load balancer? When request is made against a server which does not have a given queue, will the cluster "know" where the queue is and proxy the request to the proper server? 

Thanks, 
Vadim. 

On Mon, Feb 13, 2012 at 10:09 AM, Simon MacMullen < simon at rabbitmq.com > wrote: 

On 10/02/12 00:20, Vadim Chekan wrote: 

I think we nailed down a problem. We had a channel leak in our 
application. With ~50 connections we had >90 channels per connection and 
growing. This definitely correlates to high CPU usage. 

What I still do not understand either it triggered rabbit into unstable 
state or it was something else. Maybe increasing latencies in message 
handling triggered cluster members into flipping neighbor aliveness 
status back and force? Just speculating here: could timeouts because of 
high load cause network fragmentation, when every node temporally does 
not see neighbors, becomes a master, than see a neighbor, freak out, etc? 

That's plausible, but I don't think that's what's happening (there's nothing about network partitioning in the logs). 

I've attached logs from all 3 cluster members. They are polluted with 
load balancer "ping". 

Thanks. I've had a poke at this but nothing is leaping out at me yet. I'll keep at it though. 

One thing that's a bit odd: you seem to be creating HA / transient / autodelete / exclusive queues. So although they're "HA", they will vanish if any of the following happens: 

* The entire cluster goes down (transient) or 
* All consumers for a queue cancel (autodelete) or 
* The connection that created them closes (exclusive) 

Is this intentional? It seems like an odd use of HA. 

Cheers, Simon 

-- 
Simon MacMullen 
RabbitMQ, VMware 

--