[rabbitmq-discuss] RabbitMQ clustering woes

Thu Aug 30 23:04:36 BST 2012

On 30 Aug 2012, at 15:08, gerhard wrote:

> Running a rabbitmq cluster makes rabbitmq extremely unstable. I have a very simple, 2 nodes setup (both disc) for HA purposes. The most important queues have the HA flag ensuring they get replicated on both nodes so that if one goes down, the clients can reconnect to the other one and continue consuming messages as if nothing happened. In theory, it sounds perfect, but in practice it's such a pain!
> 

That shouldn't be the case, obviously. From the sounds of things, the issues you're experiencing are to do with HA, rather than clustering, which in and of itself is not just an HA feature. 

> If one node goes down, the performance of the remaining node gets seriously affected. Everything slows down, on occasions the broker just crashes. Any rabbitmqctl commands on the remaing node take 30seconds+ if the node is relatively quiet. Resetting the cluster doesn't work without a force_reset, I keep getting the no_running_cluster_nodes error.
> 

This is not normal behaviour. In a two node cluster with HA enabled, there will be *some* performance degradation during failover, as the remaining node has a lot of work to do, but this 'slow down' should not last for too long, the broker should not crash and rabbitmqctl should not keep taking 30+ seconds to complete commands once things have evened out a bit.

It *sounds* to me like your problems are worth investigating, but we'll need some more information to go on. 

- How often is one of the nodes failing? This (obviously) shouldn't happen very often. 
- What is causing one of the nodes to fail? Is it a rabbit crash, or something else (external to the broker)?
- What kind of load is the system (i.e., the broker) under whilst failover takes place?
- How many HA queues do you have and how many exchanges + bindings around these?
- Are the machines on the same LAN, Subnet, etc. Any specific network kit in between them?

If you can post the logs (and sasl logs) from both brokers and `rabbitmqctl report` output, that would also be very helpful. If you're able to provide a minimal example (written, scripted or whatever) then that would also enable us to diagnose and fix any problems more rapidly. Any other diagnostic information would also be useful (info on CPU, memory, network and disk usage on the given nodes for example).

> Is anyone else having the same issues with RabbitMQ clustering? As it stands, it's pretty unusable. I'm running RabbitMQ 2.8.6 on Erlang R13B03 under Ubuntu 10.04.03 64bit, installed via the official http://www.rabbitmq.com/debian/dists/testing/ packages.

We've seen a few bugs in the HA code which can crop up under very specific circumstances and we've fixed most of them, released many of the fixes and are trying to get the remaining fixes completed and released asap.

Please also note that we *highly* recommend upgrading to the latest Erlang/OTP release if possible (R15B01 at this time), as R13 is quite old and a lot of fixes and performance improvements have gone into the VM since then.

Thanks,
Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120830/7ae89ecf/attachment.htm>