[rabbitmq-discuss] RabbitMQ and Clustering on Windows
Ron.Cordell at RelayHealth.com
Fri Oct 11 00:30:18 BST 2013
We've been trying to run RabbitMQ as a cluster on Windows Server 2008 R2 for some time now and are having stability issues – namely the cluster will partition spontaneously.
The cluster originally consists of 5 RabbitMQ "nodes", each running on a separate VMWare ESX server.
Each instance is separated from another instance by at most two switch hops and no routers in the same cage in the data center. So while each instance is on a separate ESX physical host and in separate racks/power/etc, the network is such that there is no routing between Rabbit cluster nodes and all nodes are in the same VLAN. The racks are physically next to each other.
The cluster itself is behind an F5 load balancer which provides a VIP to the application servers. The F5 round robbins to the Rabbit nodes in the cluster. The F5 attempts to connect to each node in the cluster to determine if the node is up and adds/removes from the load balancing pool automatically.
All of the queues on this cluster are on the same vHost, and most of the queues are mirrored/HA queues.
The version of Rabbit is 3.1.5 and Erlang 16B01, 64 bit Windows.
On every environment from PERFORMANCE, STAGE, INTEGRATION and PRODUCTION we see the cluster partition. It can happen 15 minutes after starting the cluster to a number of days.
Network latency is sub-millisecond between cluster nodes.
We have dropped the cluster size down to 3, then 2.
We have moved all nodes on the cluster to the same VMWare ESX server, which removes all of the physical networking.
We have upgraded from ESX 4.5.x to ESX 5.1.
We've tried every configuration we can think of but still see the cluster partition after some time.
We have one environment up and running for 18 days, another for 12 days, the rest have reached, at most, 2-3 days.
Network monitors show no disruptions or latency in network traffic.
It doesn't matter if the application is running against the cluster or not.
And yet we continue to experience these network partitions.
Unfortunately I'm not able to test on a set of Linux machines in this environment.
I've sent logs in previous posts and Simon has said that we have to fix our network partition problem, but as far as we can tell we don't have one.
My questions are:
* is this expected behavior?
* Is this a Erlang on Windows problem? Is anyone running an HA cluster of similar configuration successfully on the equivalent Rabbit/Erlang versions?
* Is there anything that I should be checking that we might have missed?
All ideas are welcome!
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the rabbitmq-discuss