<div>
<div>Francesco,</div><div><br></div><div>Thanks for the answers, I've added some clarification below.</div><div><br></div><div><span style="color: rgb(160, 160, 168); ">On Monday, 24. December 2012 at 12:31, Francesco Mazzoli wrote:</span></div><blockquote type="cite"><div><br></div><div>Uhm, two things:</div><div><br></div><div>* RabbitMQ does not like FQDNs - I think you can probably make it work by</div><div>tweaking the startup scripts, but I am not sure. What we recommend is not</div><div>to use FQDNs.</div><div>* If you change the hostname and restart RabbitMQ, things should go smoothly.</div><div><br></div><div>That said, I’m not sure what DHCP has to do with this. I suppose that what you</div><div>have done is to make possible to resolve the short names of the other nodes,</div><div>since things are working.</div><div><br></div></blockquote><div>Sorry that I wasn't more clear on this. The reference to DHCP is my way of updating either the /etc/hosts file or /etc/resolv.conf. EC2 instances use DHCP to get their network settings, and the DHCP client is responsible for writing things like /etc/resolv.conf, so I used that mechanism to mix in our own domain to search in.</div><div><br></div><div>Good to know about the FQDN, I'll leave things as they are then.</div><div><br></div><div>Regarding changing the host name, I tried that, and from my experiments, that didn't seem to work properly. But that might have been related again to the hostname temporarily not being resolvable. </div><blockquote type="cite"><div></div><div>The behaviour you describe is correct. The `cluster_nodes' list is only</div><div>effective on “virgin” nodes, that is nodes that are started for the first time.</div><div>If the node can’t connect to any other node it won’t do anything. The rationale</div><div>behind that comes from boring technicalities related to how mnesia (the database</div><div>that backs RabbitMQ clustering) works, and most importantly the fact that</div><div>clustered nodes shouldn’t experience netsplits. So if you say “Should a new</div><div>node be partitioned from the others only temporarily...”, maybe clustering is</div><div>not the right solution.</div><div><br></div></blockquote><div>The partitioning was mostly conjecture. It might happen, and it's still fixable if it does. The whole mechanism just strikes me as less than ideal when it comes to automating a cluster setup. </div><blockquote type="cite"><div></div><div>Again, I’m don’t see how DHCP has anything to do with this.</div><div><br></div></blockquote><div>Same as above :)</div><blockquote type="cite"><div>I don’t know anything about EC2 or Chef, and I don’t understand what you are</div><div>doing with DHCP, but I’m quite sure you’re mixing up two separated issues: one</div><div>is related to FQDNs and how nodes resolve other node names, and the other is</div><div>related to how the automatic clustering configuration works.</div><div><br></div></blockquote><div>They might seem different on the surface, but given how the RabbitMQ package installer works, they bot converge into the issue of automating and customizing a cluster setup. </div><blockquote type="cite"><div></div><div>For what concerns the former, RabbitMQ nodes will use unqualified names (unless</div><div>you are tweaking how RabbitMQ nodes are started up), so you’ll have to play by</div><div>that. Obviously you’ll have to make the unqualified names resolve correctly on</div><div>each node, e.g. by adding entries in the hosts file.</div><div><br></div><div>For what concerns the latter, if you are confident that netsplits won’t occur,</div><div>then the semantics of the `cluster_nodes' configuration should be fine. If that</div><div>is not the case you might need to hack up some other mechanism but I would</div><div>advise against that because</div><div><br></div><div>1. It’s easy to get wrong, clustering is quite delicate and there are many</div><div>subtleties.</div></blockquote><div>Are all of them documented? Looking at the documentation, it seems to be mostly straight-forward, except for the netsplits. If there are that many subtleties, I sure wish they'd be mentioned properly in the documentation.</div><blockquote type="cite"><div>2. If you are on a network that is affected by netsplits, you should avoid</div><div>clustering anyway - see <<a href="http://www.rabbitmq.com/clustering.html">http://www.rabbitmq.com/clustering.html</a>>.</div><div><br></div></blockquote><div>The cluster is going to run with all nodes in a single available zone. Netsplits are unavoidable in any network but are less likely to happen in a setup like that.</div><div><br></div><div>My biggest question still remains unanswered unfortunately: how do folks go about automating a cluster setup, with or without the issues described in my original email?</div><div><br></div><div>Any input on that particular topic would be much appreciated.</div><div><br></div><div>Thanks!</div><div><br></div><div>Cheers, Mathias</div>
</div>
<div><div><br></div></div>
<p style="color: #A0A0A8;">On Monday, 24. December 2012 at 12:31, Francesco Mazzoli wrote:</p>
<blockquote type="cite" style="border-left-style:solid;border-width:1px;margin-left:0px;padding-left:10px;">
<span><div><div><div><br></div><div><br></div><div>Hi Mathias,</div><div><br></div><div>At Thu, 20 Dec 2012 15:48:05 +0100,</div><div>Mathias Meyer wrote:</div><blockquote type="cite"><div><div>I spent some quality time automating a cluster setup of 3.0.1 on EC2 over the</div><div>last couple of days, and came across some things that felt a bit odd to</div><div>me. Maybe I did it all wrong, happy to be proven that I did. I apologize if</div><div>this turns out a bit long, there are a couple of questions that boil down to a</div><div>similar issue. Would be curious to hear how other folks have solved this</div><div>issue, and mostly focussing on RabbitMQ 3.0, as, from what I've read,</div><div>clustering behaviour has changed with that release.</div><div><br></div><div>First a whee bit about the setup, nothing special really, but required to</div><div>clarify some of the questions below: every node has a custom CNAME record</div><div><a href="http://rabbitmq1.domain.com">rabbitmq1.domain.com</a>, <a href="http://rabbitmq2.domain.com">rabbitmq2.domain.com</a>, each pointing to the EC2 host</div><div>name. I do this because I prefer full hostnames over EC2 hostnames because it</div><div>adds clarity to the whole setup, at least for me :)</div><div><br></div><div>The first issue with this setup comes up when you want to change the hostname</div><div>for the RabbitMQ node. Changing it post-installation is a bit of a hassle</div><div>because the package already starts up the service. Changing the nodename to</div><div>the FQDN and trying to restart the service after that leads to errors because</div><div>the service can't be stopped anymore as the nodenames are now different.</div><div><br></div><div>I solved this on EC2 by adding <a href="http://domain.com">domain.com</a> to the DHCP configuration and</div><div>restarting the network before installing the RabbitMQ package. It's not a</div><div>great solution, but acceptable. In the end it boils down to the package</div><div>starting the service immediately on installation, more on that below.</div></div></blockquote><div><br></div><div>Uhm, two things:</div><div><br></div><div> * RabbitMQ does not like FQDNs - I think you can probably make it work by</div><div> tweaking the startup scripts, but I am not sure. What we recommend is not</div><div> to use FQDNs.</div><div> * If you change the hostname and restart RabbitMQ, things should go smoothly.</div><div><br></div><div>That said, I’m not sure what DHCP has to do with this. I suppose that what you</div><div>have done is to make possible to resolve the short names of the other nodes,</div><div>since things are working.</div><div><br></div><blockquote type="cite"><div><div>The next issue is related to clustering. When RabbitMQ starts up, and there's</div><div>a cluster_nodes section in the config, it seems to try and reach the nodes</div><div>only once. This behaviour I'm not sure of, hence the question. I noticed that</div><div>when a node can't look up any of the nodes in the cluster_nodes config, it</div><div>won't try again at a later point in time, e.g. on restart. Therefore that node</div><div>will never automatically join the cluster unless rabbitmqctl join_cluster is</div><div>called.</div><div><br></div><div>The DHCP configuration helped solve this issue as well, but I'm still</div><div>wondering if a) my observation is correct and b) if this is desired</div><div>behaviour. Should a new node be partitioned from the others only temporarily,</div><div>when it joins, it requires manual intervention to force it to join the</div><div>cluster. This somewhat seems to conform to what the documentation says:</div><div><a href="http://www.rabbitmq.com/clustering.html#auto-config">http://www.rabbitmq.com/clustering.html#auto-config</a>, but I'm not entirely</div><div>clear on whether a node just gives up trying to form a cluster once it</div><div>couldn't reach any of the nodes in the cluster_nodes list.</div></div></blockquote><div><br></div><div>The behaviour you describe is correct. The `cluster_nodes' list is only</div><div>effective on “virgin” nodes, that is nodes that are started for the first time.</div><div>If the node can’t connect to any other node it won’t do anything. The rationale</div><div>behind that comes from boring technicalities related to how mnesia (the database</div><div>that backs RabbitMQ clustering) works, and most importantly the fact that</div><div>clustered nodes shouldn’t experience netsplits. So if you say “Should a new</div><div>node be partitioned from the others only temporarily...”, maybe clustering is</div><div>not the right solution.</div><div><br></div><div>Again, I’m don’t see how DHCP has anything to do with this.</div><div><br></div><blockquote type="cite"><div><div>So the question boils down to whether the automatic configuration is the way</div><div>to go or if it makes more sense to automate commands (using Chef, btw) around</div><div>the join_cluster and cluster_status commands.</div><div><br></div><div>On top of that, to have a fresh node join the cluster, it needs to be stopped</div><div>again (stop_app) and reset, which somewhat boils down to the service being</div><div>started on package installation again. This behaviour seems to also have</div><div>changed from 2.x where just updating the config and making sure all nodes have</div><div>the same Erlang cookie is correct, right?</div><div><br></div><div>So the biggest question is: how do folks work around the fact that the</div><div>RabbitMQ node is started up on package installation. There's the option to use</div><div>policy-rc.d systems, but I'm not quite sure how feasible this is to</div><div>automate. Given that Chef runs every 30 minutes or so, the consequence would</div><div>be to install a policy on every run or to check on every run whether or not</div><div>the desired package is already installed. Currently I'm stopping the service</div><div>after stop_app and reset, before installing the new Erlang cookie. I'm just</div><div>not sure, it feels a bit weird to automate to me. I'd love for some input or</div><div>suggestions on this.</div><div><br></div><div>My current setup is working, and I can add nodes that automatically join the</div><div>cluster, so it's okay. Just want to make sure it's the right approach, or if</div><div>there are any other experiences on this. From what I understand there are</div><div>differences to 2.x cluster setup, where updating the config and changing the</div><div>Erlang cookie apparently were all that's needed, but that's from my reading of</div><div>the documentation and existing Chef cookbooks</div><div>(<a href="https://github.com/opscode-cookbooks/rabbitmq">https://github.com/opscode-cookbooks/rabbitmq</a>). Overall I feel like there's a</div><div>bit of a lack of documentation on how setting up a cluster can or should be</div><div>automated. Happy to help improving that situation, but I'd like to sure that</div><div>the choices described above are sane or completely bullocks.</div><div><br></div><div>Again, apologies for the long email. I hope the setup, issues and questions</div><div>are somewhat clear. Please let me know if more input is require, happy to dive</div><div>into more detail.</div><div><br></div><div>Thank you for your time, for RabbitMQ clustering, and for any feedback you</div><div>might have :)</div></div></blockquote><div><br></div><div>I don’t know anything about EC2 or Chef, and I don’t understand what you are</div><div>doing with DHCP, but I’m quite sure you’re mixing up two separated issues: one</div><div>is related to FQDNs and how nodes resolve other node names, and the other is</div><div>related to how the automatic clustering configuration works.</div><div><br></div><div>For what concerns the former, RabbitMQ nodes will use unqualified names (unless</div><div>you are tweaking how RabbitMQ nodes are started up), so you’ll have to play by</div><div>that. Obviously you’ll have to make the unqualified names resolve correctly on</div><div>each node, e.g. by adding entries in the hosts file.</div><div><br></div><div>For what concerns the latter, if you are confident that netsplits won’t occur,</div><div>then the semantics of the `cluster_nodes' configuration should be fine. If that</div><div>is not the case you might need to hack up some other mechanism but I would</div><div>advise against that because</div><div><br></div><div> 1. It’s easy to get wrong, clustering is quite delicate and there are many</div><div> subtleties.</div><div> 2. If you are on a network that is affected by netsplits, you should avoid</div><div> clustering anyway - see <<a href="http://www.rabbitmq.com/clustering.html">http://www.rabbitmq.com/clustering.html</a>>.</div><div><br></div><div>And yes, things changed from 2.x on this front as part of efforts to make</div><div>clustering more solid.</div><div><br></div><div>I hope this helps.</div><div><br></div><div>Francesco</div></div></div></span>
</blockquote>
<div>
<br>
</div>