[rabbitmq-discuss] Active/Active: shutdown of one service brings down the cluster

Mon Feb 13 18:09:44 GMT 2012

On 10/02/12 00:20, Vadim Chekan wrote:
> I think we nailed down a problem. We had a channel leak in our
> application. With ~50 connections we had >90 channels per connection and
> growing. This definitely correlates to high CPU usage.
>
> What I still do not understand either it triggered rabbit into unstable
> state or it was something else. Maybe increasing latencies in message
> handling triggered cluster members into flipping neighbor aliveness
> status back and force? Just speculating here: could timeouts because of
> high load cause network fragmentation, when every node temporally does
> not see neighbors, becomes a master, than see a neighbor, freak out, etc?

That's plausible, but I don't think that's what's happening (there's 
nothing about network partitioning in the logs).

> I've attached logs from all 3 cluster members. They are polluted with
> load balancer "ping".

Thanks. I've had a poke at this but nothing is leaping out at me yet. 
I'll keep at it though.

One thing that's a bit odd: you seem to be creating HA / transient / 
autodelete / exclusive queues. So although they're "HA", they will 
vanish if any of the following happens:

* The entire cluster goes down (transient) or
* All consumers for a queue cancel (autodelete) or
* The connection that created them closes (exclusive)

Is this intentional? It seems like an odd use of HA.

Cheers, Simon

-- 
Simon MacMullen
RabbitMQ, VMware