[rabbitmq-discuss] Broken 3 node cluster

Mon Mar 17 16:58:17 GMT 2014

Hi,

We experienced almost exactly similar symptoms when we first started
working with RabbitMQ clusters. We checked everything - network, switches,
load balancer, application code. The cluster was always unstable in the way
that you describe. We brought Pivotal Labs in to do some deep dives to see
if they could help - from what they told us we weren't doing anything
incorrectly. We tried various combinations of RabbitMQ releases and Erlang
releases as well.

We put together a Linux-based cluster and never had a problem with it.
Solid as a rock and significantly more throughput. We put our Windows
cluster in a non-load balanced, two node active/passive type of
configuration. Most of our queues are mirrored/highly available and there
are a couple hundred of them. We need to be assured that we wouldn't lose
messages, so we came up with this "limp mode" to get us into production at
low message volumes until we could solidify our ops teams around Linux. We
are now replacing the Windows "clusters" with real 3-node, load balanced
Linux clusters. All of our performance testing has shown them to be
completely solid. No weird cluster node states or anything like that.

This is not a "Linux us better than Windows" diatribe. We have been a
completely Windows shop until now and we decided to incur the cost of
operationalizing Linux in our infrastructure because RabbitMQ is a key
component. Our measurements and observations told us that if we want to run
a high throughput, stable, Rabbit cluster with mostly mirrored queues it
would have to be on Linux.

-Ron

On Mon, Mar 17, 2014 at 4:50 AM, Patrick Long <pat at munkiisoft.com> wrote:

> RABBITMQ 3.2.1 on Windows Server 2003
>
>
> Came into work this morning to find a suspected Network partition on a 3
> node cluster
>
> Node 3 and Node 2 said Node 1 was down
>
> Node 1 said 2 and 3 were down
>
> Tried stop_app on Node 1 but it hung stop_app on Nodes 2 and 3 were fine.
>
> All 3 nodes hang on start_app
>
> Tried restarting Windows service. Nodes 2 and 3 come back and are clustered
>
> Node 1 will not start. In the end I removed all contents of the db
> directory. Not it starts up.
>
> I want to rejoin the cluster but it says it is already a member although
> cluster_status says otherwise.
>
> I have tried forget_cluster_node from one of the running nodes but that
> hangs
>
> Anyone any ideas?
>
>
> Thanks
>
>
>
>
> --
> Patrick Long - Munkiisoft Ltd
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140317/a78a3b8b/attachment.html>