[rabbitmq-discuss] Automating a RabbitMQ 3.0 cluster on EC2

Mathias Meyer meyer at paperplanes.de
Fri Dec 28 11:26:52 GMT 2012


Francesco,

Thanks for the answers, I've added some clarification below.

On Monday, 24. December 2012 at 12:31, Francesco Mazzoli wrote:
>  
> Uhm, two things:
>  
> * RabbitMQ does not like FQDNs - I think you can probably make it work by
> tweaking the startup scripts, but I am not sure. What we recommend is not
> to use FQDNs.
> * If you change the hostname and restart RabbitMQ, things should go smoothly.
>  
> That said, I’m not sure what DHCP has to do with this. I suppose that what you
> have done is to make possible to resolve the short names of the other nodes,
> since things are working.
>  
Sorry that I wasn't more clear on this. The reference to DHCP is my way of updating either the /etc/hosts file or /etc/resolv.conf. EC2 instances use DHCP to get their network settings, and the DHCP client is responsible for writing things like /etc/resolv.conf, so I used that mechanism to mix in our own domain to search in.

Good to know about the FQDN, I'll leave things as they are then.

Regarding changing the host name, I tried that, and from my experiments, that didn't seem to work properly. But that might have been related again to the hostname temporarily not being resolvable.  
>  
> The behaviour you describe is correct. The `cluster_nodes' list is only
> effective on “virgin” nodes, that is nodes that are started for the first time.
> If the node can’t connect to any other node it won’t do anything. The rationale
> behind that comes from boring technicalities related to how mnesia (the database
> that backs RabbitMQ clustering) works, and most importantly the fact that
> clustered nodes shouldn’t experience netsplits. So if you say “Should a new
> node be partitioned from the others only temporarily...”, maybe clustering is
> not the right solution.
>  
The partitioning was mostly conjecture. It might happen, and it's still fixable if it does. The whole mechanism just strikes me as less than ideal when it comes to automating a cluster setup.  
>  
> Again, I’m don’t see how DHCP has anything to do with this.
>  
Same as above :)
> I don’t know anything about EC2 or Chef, and I don’t understand what you are
> doing with DHCP, but I’m quite sure you’re mixing up two separated issues: one
> is related to FQDNs and how nodes resolve other node names, and the other is
> related to how the automatic clustering configuration works.
>  
They might seem different on the surface, but given how the RabbitMQ package installer works, they bot converge into the issue of automating and customizing a cluster setup.  
>  
> For what concerns the former, RabbitMQ nodes will use unqualified names (unless
> you are tweaking how RabbitMQ nodes are started up), so you’ll have to play by
> that. Obviously you’ll have to make the unqualified names resolve correctly on
> each node, e.g. by adding entries in the hosts file.
>  
> For what concerns the latter, if you are confident that netsplits won’t occur,
> then the semantics of the `cluster_nodes' configuration should be fine. If that
> is not the case you might need to hack up some other mechanism but I would
> advise against that because
>  
> 1. It’s easy to get wrong, clustering is quite delicate and there are many
> subtleties.

Are all of them documented? Looking at the documentation, it seems to be mostly straight-forward, except for the netsplits. If there are that many subtleties, I sure wish they'd be mentioned properly in the documentation.
> 2. If you are on a network that is affected by netsplits, you should avoid
> clustering anyway - see <http://www.rabbitmq.com/clustering.html>.
>  
The cluster is going to run with all nodes in a single available zone. Netsplits are unavoidable in any network but are less likely to happen in a setup like that.

My biggest question still remains unanswered unfortunately: how do folks go about automating a cluster setup, with or without the issues described in my original email?

Any input on that particular topic would be much appreciated.

Thanks!

Cheers, Mathias  


On Monday, 24. December 2012 at 12:31, Francesco Mazzoli wrote:

>  
>  
> Hi Mathias,
>  
> At Thu, 20 Dec 2012 15:48:05 +0100,
> Mathias Meyer wrote:
> > I spent some quality time automating a cluster setup of 3.0.1 on EC2 over the
> > last couple of days, and came across some things that felt a bit odd to
> > me. Maybe I did it all wrong, happy to be proven that I did. I apologize if
> > this turns out a bit long, there are a couple of questions that boil down to a
> > similar issue. Would be curious to hear how other folks have solved this
> > issue, and mostly focussing on RabbitMQ 3.0, as, from what I've read,
> > clustering behaviour has changed with that release.
> >  
> > First a whee bit about the setup, nothing special really, but required to
> > clarify some of the questions below: every node has a custom CNAME record
> > rabbitmq1.domain.com (http://rabbitmq1.domain.com), rabbitmq2.domain.com (http://rabbitmq2.domain.com), each pointing to the EC2 host
> > name. I do this because I prefer full hostnames over EC2 hostnames because it
> > adds clarity to the whole setup, at least for me :)
> >  
> > The first issue with this setup comes up when you want to change the hostname
> > for the RabbitMQ node. Changing it post-installation is a bit of a hassle
> > because the package already starts up the service. Changing the nodename to
> > the FQDN and trying to restart the service after that leads to errors because
> > the service can't be stopped anymore as the nodenames are now different.
> >  
> > I solved this on EC2 by adding domain.com (http://domain.com) to the DHCP configuration and
> > restarting the network before installing the RabbitMQ package. It's not a
> > great solution, but acceptable. In the end it boils down to the package
> > starting the service immediately on installation, more on that below.
> >  
>  
>  
> Uhm, two things:
>  
> * RabbitMQ does not like FQDNs - I think you can probably make it work by
> tweaking the startup scripts, but I am not sure. What we recommend is not
> to use FQDNs.
> * If you change the hostname and restart RabbitMQ, things should go smoothly.
>  
> That said, I’m not sure what DHCP has to do with this. I suppose that what you
> have done is to make possible to resolve the short names of the other nodes,
> since things are working.
>  
> > The next issue is related to clustering. When RabbitMQ starts up, and there's
> > a cluster_nodes section in the config, it seems to try and reach the nodes
> > only once. This behaviour I'm not sure of, hence the question. I noticed that
> > when a node can't look up any of the nodes in the cluster_nodes config, it
> > won't try again at a later point in time, e.g. on restart. Therefore that node
> > will never automatically join the cluster unless rabbitmqctl join_cluster is
> > called.
> >  
> > The DHCP configuration helped solve this issue as well, but I'm still
> > wondering if a) my observation is correct and b) if this is desired
> > behaviour. Should a new node be partitioned from the others only temporarily,
> > when it joins, it requires manual intervention to force it to join the
> > cluster. This somewhat seems to conform to what the documentation says:
> > http://www.rabbitmq.com/clustering.html#auto-config, but I'm not entirely
> > clear on whether a node just gives up trying to form a cluster once it
> > couldn't reach any of the nodes in the cluster_nodes list.
> >  
>  
>  
> The behaviour you describe is correct. The `cluster_nodes' list is only
> effective on “virgin” nodes, that is nodes that are started for the first time.
> If the node can’t connect to any other node it won’t do anything. The rationale
> behind that comes from boring technicalities related to how mnesia (the database
> that backs RabbitMQ clustering) works, and most importantly the fact that
> clustered nodes shouldn’t experience netsplits. So if you say “Should a new
> node be partitioned from the others only temporarily...”, maybe clustering is
> not the right solution.
>  
> Again, I’m don’t see how DHCP has anything to do with this.
>  
> > So the question boils down to whether the automatic configuration is the way
> > to go or if it makes more sense to automate commands (using Chef, btw) around
> > the join_cluster and cluster_status commands.
> >  
> > On top of that, to have a fresh node join the cluster, it needs to be stopped
> > again (stop_app) and reset, which somewhat boils down to the service being
> > started on package installation again. This behaviour seems to also have
> > changed from 2.x where just updating the config and making sure all nodes have
> > the same Erlang cookie is correct, right?
> >  
> > So the biggest question is: how do folks work around the fact that the
> > RabbitMQ node is started up on package installation. There's the option to use
> > policy-rc.d systems, but I'm not quite sure how feasible this is to
> > automate. Given that Chef runs every 30 minutes or so, the consequence would
> > be to install a policy on every run or to check on every run whether or not
> > the desired package is already installed. Currently I'm stopping the service
> > after stop_app and reset, before installing the new Erlang cookie. I'm just
> > not sure, it feels a bit weird to automate to me. I'd love for some input or
> > suggestions on this.
> >  
> > My current setup is working, and I can add nodes that automatically join the
> > cluster, so it's okay. Just want to make sure it's the right approach, or if
> > there are any other experiences on this. From what I understand there are
> > differences to 2.x cluster setup, where updating the config and changing the
> > Erlang cookie apparently were all that's needed, but that's from my reading of
> > the documentation and existing Chef cookbooks
> > (https://github.com/opscode-cookbooks/rabbitmq). Overall I feel like there's a
> > bit of a lack of documentation on how setting up a cluster can or should be
> > automated. Happy to help improving that situation, but I'd like to sure that
> > the choices described above are sane or completely bullocks.
> >  
> > Again, apologies for the long email. I hope the setup, issues and questions
> > are somewhat clear. Please let me know if more input is require, happy to dive
> > into more detail.
> >  
> > Thank you for your time, for RabbitMQ clustering, and for any feedback you
> > might have :)
> >  
>  
>  
> I don’t know anything about EC2 or Chef, and I don’t understand what you are
> doing with DHCP, but I’m quite sure you’re mixing up two separated issues: one
> is related to FQDNs and how nodes resolve other node names, and the other is
> related to how the automatic clustering configuration works.
>  
> For what concerns the former, RabbitMQ nodes will use unqualified names (unless
> you are tweaking how RabbitMQ nodes are started up), so you’ll have to play by
> that. Obviously you’ll have to make the unqualified names resolve correctly on
> each node, e.g. by adding entries in the hosts file.
>  
> For what concerns the latter, if you are confident that netsplits won’t occur,
> then the semantics of the `cluster_nodes' configuration should be fine. If that
> is not the case you might need to hack up some other mechanism but I would
> advise against that because
>  
> 1. It’s easy to get wrong, clustering is quite delicate and there are many
> subtleties.
> 2. If you are on a network that is affected by netsplits, you should avoid
> clustering anyway - see <http://www.rabbitmq.com/clustering.html>.
>  
> And yes, things changed from 2.x on this front as part of efforts to make
> clustering more solid.
>  
> I hope this helps.
>  
> Francesco  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20121228/15cce59e/attachment.htm>


More information about the rabbitmq-discuss mailing list