[rabbitmq-discuss] Fully reliable setup impossible?

Thu May 22 17:16:58 BST 2014

On your clustering and clients connecting transparently - I highly
recommend a load balancer in front of your Rabbit servers.  With tcp-half
open monitoring on the rabbit port, you can tell pretty quickly when a
node/site goes down, and then get failover to one of the other
nodes/clusters.  With clustering and mirrored queues and by using publisher
confirms you'll avoid data loss this way.  You CAN get data duplication
though.  But I'd only recommend clustering over a really reliable link.  If
you're going across a WAN - use shovel/federations to replicate messages to
rabbit clusters on the other side, vs. trying to do cross-wan clusters.
 You could for run a cluster in each site and use federation to send
messages to the other cluster as needed.  If any given site goes down, your
load balancer could switch traffic to the other cluster.  There's still a
chance for downtime, but it's pretty minimal.  We use this to redirect
traffic to any given node in the cluster right now so if a single node
fails, the load balancers pull that node out of service automatically.

Regarding question 2 - if you design it right, using confirms (the default
in most clients as i understand it), and use persistent messages, you'll
never get message loss with a mirrored queue, unless ALL servers completely
crash and hard drives die.  At least this is my understanding :)

Last point - you may want to do manual handling of this situation if it's
that much of concern.  e.g. let the nodes remain partitioned, let all
messages empty (again, remember there would be duplicates possible), then
restrict access to the bad nodes to anything but your consumer processes,
shut down the "bad" nodes and bring them back up.  They'd not have any
messages, and they'd get their queues/exchanges from  your "master" node
that was good when they come back up.  In the case of a load balancer in
front, you could use your load balancer to control this very effectively.

Definitely read through partitioning and reliability documentation and
actually try these scenarios:
https://www.rabbitmq.com/partitions.html
https://www.rabbitmq.com/reliability.html

Jason

On Thu, May 22, 2014 at 8:04 AM, Steffen Daniel Jensen <
steffen.daniel.jensen at gmail.com> wrote:

> We have two data centers connected closely by LAN.
>
> We are interested in a *reliable cluster* setup. It must be a cluster
> because we want clients to be able to connect to each node transparently.
> Federation is not an option.
> 1. It happens that the firewall/switch is restarted, and maybe a few ping
> messages are lost.
> 2. The setup should survive data center crash
> 3. All queues are durable and mirrored, all messages are persisted, all
> publishes are confirmed
>
> There are 3 cluster-recovery settings
> a) ignore: A cross data center network break-down would cause message
> loss on the node that is restarted In order to rejoin.
> b) pause_minority: If we choose the same number of nodes in each data
> center, the whole cluster will pause. If we don't, only the data center
> with the most nodes can survive.
> c) auto_heal: If the cluster decides network partitioning, there is a
> potential of message loss, when joining.
> [I would really like a resync-setting similar to the one described below]
>
> Question 1: Is it even possible to have a fully reliable setup in such a
> setting?
>
> In reality we probably won't have actual network partitions, and it will
> most probably only be a very short network downtime.
>
> Question 2: Is it possible to adjust how long it takes rabbitmq to decide
> "node down"?
>
> It is much better to have a halted rabbitmq for some seconds than to have
> message loss.
>
>
> Question 3: Assume that we are using the ignore setting, and that we have
> only two nodes in the cluster. Would the following be a full recovery with
> zero message loss?
>
> 0. Decide which node survives, Ns, and which should be restarted, Nr.
> 1. Refuse all connections to Nr except from a special recovery
> application. (One could change the ip, so all running services can't
> connect or similar)
> 2. Consume and republish all message from Nr to Ns.
> 3. Restart Nr
> Then the cluster should be up-and-running again.
>
> Since all queues are mirrored, all messages published in the partition
> time is preserved. If a certain service lives only in the one data center,
> messages will pile up in the other (if there are any publishes).
>
> If you have any other suggestions, I would be very interested to hear them.
>
> I would be really sad to find it necessary to choose Tibco ESB over
> RabbitMQ, for this reason.
>
> Thank you,
> -- Steffen
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>

-- 
Jason McIntosh
https://github.com/jasonmcintosh/
573-424-7612
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140522/1da85994/attachment.html>