[rabbitmq-discuss] Node Resilency

Fri Dec 21 10:20:58 GMT 2012

Hi

On 20 Dec 2012, at 10:42, PSL 88506 wrote:

> We are using RabbitMQ in our application.
> We encountered problem last week in production.
> 
> We have 5 server - clustered and no load balancer is used.
> Suddenly over console when we opened RabbitServer-1, It is showing  RabbitServer-2,3,4,5 in red, Not Running and RabbitServer-1 is Running.
> When RabbitServer-2 console is opened, it showed that RabbitServer-2 is Running and other is Red and Not Running.
> It is same for all 5 server.
> 
> We understood that cluster was broken. Please let us know if there could be any other issues. Also our team has raised request to Network team to know if any network flucuations happened.
> 

If the other nodes are showing up in red, then they're definitely inaccessible. This could be down to network interruptions. I'm afraid I can't suggest what might be wrong just on the basis of red lights in the 'console' - I assume you're talking about the management web interface here?

> Hence we decided to handle node failure to ensure node resiliency.
> 
> Could you please help in throwing some light on how to ensure node resiliency.
> 

There are a great number of factors to take into account when planning this sort of thing. The *most* important factor when using rabbit in a cluster, is to ensure you don't encounter any net-split, as rabbit doesn't handle these well. I have no idea why your nodes became unable to see one another, but if you post the logs (or a subset of them) somewhere then we can take a look. If you can identify why the nodes got disconnected, it would help figure out how to guard against this in future.

Tim