[rabbitmq-discuss] Automating a RabbitMQ 3.0 cluster on EC2

Mon Dec 24 11:31:23 GMT 2012

Hi Mathias,

At Thu, 20 Dec 2012 15:48:05 +0100,
Mathias Meyer wrote:
> I spent some quality time automating a cluster setup of 3.0.1 on EC2 over the
> last couple of days, and came across some things that felt a bit odd to
> me. Maybe I did it all wrong, happy to be proven that I did. I apologize if
> this turns out a bit long, there are a couple of questions that boil down to a
> similar issue. Would be curious to hear how other folks have solved this
> issue, and mostly focussing on RabbitMQ 3.0, as, from what I've read,
> clustering behaviour has changed with that release.
> 
> First a whee bit about the setup, nothing special really, but required to
> clarify some of the questions below: every node has a custom CNAME record
> rabbitmq1.domain.com, rabbitmq2.domain.com, each pointing to the EC2 host
> name. I do this because I prefer full hostnames over EC2 hostnames because it
> adds clarity to the whole setup, at least for me :)
> 
> The first issue with this setup comes up when you want to change the hostname
> for the RabbitMQ node. Changing it post-installation is a bit of a hassle
> because the package already starts up the service. Changing the nodename to
> the FQDN and trying to restart the service after that leads to errors because
> the service can't be stopped anymore as the nodenames are now different.
> 
> I solved this on EC2 by adding domain.com to the DHCP configuration and
> restarting the network before installing the RabbitMQ package. It's not a
> great solution, but acceptable. In the end it boils down to the package
> starting the service immediately on installation, more on that below.

Uhm, two things:

  * RabbitMQ does not like FQDNs - I think you can probably make it work by
    tweaking the startup scripts, but I am not sure.  What we recommend is not
    to use FQDNs.
  * If you change the hostname and restart RabbitMQ, things should go smoothly.

That said, I’m not sure what DHCP has to do with this.  I suppose that what you
have done is to make possible to resolve the short names of the other nodes,
since things are working.

> The next issue is related to clustering. When RabbitMQ starts up, and there's
> a cluster_nodes section in the config, it seems to try and reach the nodes
> only once. This behaviour I'm not sure of, hence the question. I noticed that
> when a node can't look up any of the nodes in the cluster_nodes config, it
> won't try again at a later point in time, e.g. on restart. Therefore that node
> will never automatically join the cluster unless rabbitmqctl join_cluster is
> called.
> 
> The DHCP configuration helped solve this issue as well, but I'm still
> wondering if a) my observation is correct and b) if this is desired
> behaviour. Should a new node be partitioned from the others only temporarily,
> when it joins, it requires manual intervention to force it to join the
> cluster. This somewhat seems to conform to what the documentation says:
> http://www.rabbitmq.com/clustering.html#auto-config, but I'm not entirely
> clear on whether a node just gives up trying to form a cluster once it
> couldn't reach any of the nodes in the cluster_nodes list.

The behaviour you describe is correct.  The `cluster_nodes' list is only
effective on “virgin” nodes, that is nodes that are started for the first time.
If the node can’t connect to any other node it won’t do anything.  The rationale
behind that comes from boring technicalities related to how mnesia (the database
that backs RabbitMQ clustering) works, and most importantly the fact that
clustered nodes shouldn’t experience netsplits.  So if you say “Should a new
node be partitioned from the others only temporarily...”, maybe clustering is
not the right solution.

Again, I’m don’t see how DHCP has anything to do with this.

> So the question boils down to whether the automatic configuration is the way
> to go or if it makes more sense to automate commands (using Chef, btw) around
> the join_cluster and cluster_status commands.
> 
> On top of that, to have a fresh node join the cluster, it needs to be stopped
> again (stop_app) and reset, which somewhat boils down to the service being
> started on package installation again. This behaviour seems to also have
> changed from 2.x where just updating the config and making sure all nodes have
> the same Erlang cookie is correct, right?
> 
> So the biggest question is: how do folks work around the fact that the
> RabbitMQ node is started up on package installation. There's the option to use
> policy-rc.d systems, but I'm not quite sure how feasible this is to
> automate. Given that Chef runs every 30 minutes or so, the consequence would
> be to install a policy on every run or to check on every run whether or not
> the desired package is already installed. Currently I'm stopping the service
> after stop_app and reset, before installing the new Erlang cookie. I'm just
> not sure, it feels a bit weird to automate to me. I'd love for some input or
> suggestions on this.
> 
> My current setup is working, and I can add nodes that automatically join the
> cluster, so it's okay. Just want to make sure it's the right approach, or if
> there are any other experiences on this. From what I understand there are
> differences to 2.x cluster setup, where updating the config and changing the
> Erlang cookie apparently were all that's needed, but that's from my reading of
> the documentation and existing Chef cookbooks
> (https://github.com/opscode-cookbooks/rabbitmq). Overall I feel like there's a
> bit of a lack of documentation on how setting up a cluster can or should be
> automated. Happy to help improving that situation, but I'd like to sure that
> the choices described above are sane or completely bullocks.
> 
> Again, apologies for the long email. I hope the setup, issues and questions
> are somewhat clear. Please let me know if more input is require, happy to dive
> into more detail.
> 
> Thank you for your time, for RabbitMQ clustering, and for any feedback you
> might have :)

I don’t know anything about EC2 or Chef, and I don’t understand what you are
doing with DHCP, but I’m quite sure you’re mixing up two separated issues: one
is related to FQDNs and how nodes resolve other node names, and the other is
related to how the automatic clustering configuration works.

For what concerns the former, RabbitMQ nodes will use unqualified names (unless
you are tweaking how RabbitMQ nodes are started up), so you’ll have to play by
that.  Obviously you’ll have to make the unqualified names resolve correctly on
each node, e.g. by adding entries in the hosts file.

For what concerns the latter, if you are confident that netsplits won’t occur,
then the semantics of the `cluster_nodes' configuration should be fine.  If that
is not the case you might need to hack up some other mechanism but I would
advise against that because

  1. It’s easy to get wrong, clustering is quite delicate and there are many
     subtleties.
  2. If you are on a network that is affected by netsplits, you should avoid
     clustering anyway - see <http://www.rabbitmq.com/clustering.html>.

And yes, things changed from 2.x on this front as part of efforts to make
clustering more solid.

I hope this helps.

Francesco