[rabbitmq-discuss] Fully reliable setup impossible?

Steffen Daniel Jensen steffen.daniel.jensen at gmail.com
Thu May 22 14:04:12 BST 2014


We have two data centers connected closely by LAN. 
 
We are interested in a *reliable cluster* setup. It must be a cluster 
because we want clients to be able to connect to each node transparently. 
Federation is not an option.
1. It happens that the firewall/switch is restarted, and maybe a few ping 
messages are lost.
2. The setup should survive data center crash
3. All queues are durable and mirrored, all messages are persisted, all 
publishes are confirmed
 
There are 3 cluster-recovery settings
a) ignore: A cross data center network break-down would cause message loss 
on the node that is restarted In order to rejoin.
b) pause_minority: If we choose the same number of nodes in each data 
center, the whole cluster will pause. If we don't, only the data center 
with the most nodes can survive. 
c) auto_heal: If the cluster decides network partitioning, there is a 
potential of message loss, when joining.
[I would really like a resync-setting similar to the one described below]
 
Question 1: Is it even possible to have a fully reliable setup in such a 
setting?
 
In reality we probably won't have actual network partitions, and it will 
most probably only be a very short network downtime.
 
Question 2: Is it possible to adjust how long it takes rabbitmq to decide 
"node down"?
 
It is much better to have a halted rabbitmq for some seconds than to have 
message loss.
 
 
Question 3: Assume that we are using the ignore setting, and that we have 
only two nodes in the cluster. Would the following be a full recovery with 
zero message loss? 
 
0. Decide which node survives, Ns, and which should be restarted, Nr.
1. Refuse all connections to Nr except from a special recovery application. 
(One could change the ip, so all running services can't connect or similar)
2. Consume and republish all message from Nr to Ns.
3. Restart Nr
Then the cluster should be up-and-running again.
 
Since all queues are mirrored, all messages published in the partition time 
is preserved. If a certain service lives only in the one data center, 
messages will pile up in the other (if there are any publishes).
 
If you have any other suggestions, I would be very interested to hear them.
 
I would be really sad to find it necessary to choose Tibco ESB over 
RabbitMQ, for this reason.
 
Thank you,
-- Steffen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140522/7c6d6d40/attachment.html>


More information about the rabbitmq-discuss mailing list