[rabbitmq-discuss] Production cluster fun, make that not so fun (production cluster setup guidance)!

Tim Watson tim at rabbitmq.com
Tue May 21 12:03:07 BST 2013

Hi Stefan,

Sorry for the delay in getting back to you...

On 14 May 2013, at 05:52, Stefan Sedich wrote:

> What is the ideal setup for our production cluster, currently we have two ABs with two nodes in each, after setting pause_minority today and having a partition where all nodes were paused, we changed to autoheal and restarted all nodes. In the midst of this something happened, and even rolling back to just 1 node apps were not able to connect to the node.

Sounds like the node got stuck during recovery.

> In the end my only fix was to completely trash the db/ of that one node and restart it which got things working again (no idea what got corrupted or how but something went bad). Now after this is would be good for some guidance on how to properly configure out cluster and how many nodes would be ideal and what recovery node to use (we seem to occasionally have a partition, even though it SHOULD be a stable link).

The pause_minority setting will cause nodes in the minority island to literally pause (with regards cluster membership) until they've seen the cluster recover - i.e., all the expected nodes come back online. As to "what recovery node to use", there is no answer to that because there is no concept of a 'special' node in a rabbit cluster. Or did you rather mean what recovery *mode* to use?

Please note that whilst the autoheal features are intended to simply the process of partition recovery, they're not a panacea and a partition can still result in states where manual intervention is required. It would be good to understand how your current cluster topology is used by clients, to get a feel for what the impact of the recovery process is likely to be.


More information about the rabbitmq-discuss mailing list