[rabbitmq-discuss] Fully reliable setup impossible?

Simon MacMullen simon at rabbitmq.com
Thu May 22 17:33:20 BST 2014


On 22/05/14 14:04, Steffen Daniel Jensen wrote:
> We have two data centers connected closely by LAN.
> We are interested in a *reliable cluster* setup. It must be a cluster
> because we want clients to be able to connect to each node
> transparently. Federation is not an option.

I hope you realise that you are asking for a lot here! You should read 
up on the CAP theorem if you have not already done so.

> 1. It happens that the firewall/switch is restarted, and maybe a few
> ping messages are lost.
> 2. The setup should survive data center crash
> 3. All queues are durable and mirrored, all messages are persisted, all
> publishes are confirmed
> There are 3 cluster-recovery settings
> a) ignore: A cross data center network break-down would cause message
> loss on the node that is restarted In order to rejoin.
> b) pause_minority: If we choose the same number of nodes in each data
> center, the whole cluster will pause. If we don't, only the data center
> with the most nodes can survive.
> c) auto_heal: If the cluster decides network partitioning, there is a
> potential of message loss, when joining.
> [I would really like a resync-setting similar to the one described below]
> Question 1: Is it even possible to have a fully reliable setup in such a
> setting?

Depends how you define "fully reliable". If you want Consistency (i.e. 
mirrored queues), Availability (i.e. neither data centre pauses) and 
Partition tolerance (no loss of data from either side if the network 
goes down between them) then I'm afraid you can't.

> In reality we probably won't have actual network partitions, and it will
> most probably only be a very short network downtime.
> Question 2: Is it possible to adjust how long it takes rabbitmq to
> decide "node down"?

Yes, see http://www.rabbitmq.com/nettick.html

> It is much better to have a halted rabbitmq for some seconds than to
> have message loss.
> Question 3: Assume that we are using the ignore setting, and that we
> have only two nodes in the cluster. Would the following be a full
> recovery with zero message loss?
> 0. Decide which node survives, Ns, and which should be restarted, Nr.
> 1. Refuse all connections to Nr except from a special recovery
> application. (One could change the ip, so all running services can't
> connect or similar)
> 2. Consume and republish all message from Nr to Ns.
> 3. Restart Nr
> Then the cluster should be up-and-running again.

That sounds like it would work. You're losing some availability and 
consistency, and your message ordering will change. You have a pretty 
good chance of duplicating lots of messages too (any that were in the 
queues when the partition happened). Assuming you're happy with that it 
sounds reasonable.

Cheers, Simon

-- 
Simon MacMullen
RabbitMQ, Pivotal


More information about the rabbitmq-discuss mailing list