<div dir="ltr">I also posted this to StackOverflow.<div><br></div><div><p style="margin:0px 0px 1em;padding:0px;border:0px;font-size:14px;vertical-align:baseline;clear:both;color:rgb(0,0,0);font-family:Arial,'Liberation Sans','DejaVu Sans',sans-serif;line-height:17.804800033569336px">
We run several production rabbit clusters (two nodes each), and transfer messages between them using federation. In development/testing, we only have one cluster due to hardware/resource limitations. In order to (poorly, but still better than nothing) test federation, we create federation links <em style="margin:0px;padding:0px;border:0px;vertical-align:baseline;background-color:transparent">between two virtual hosts on the same cluster</em>. This is sketchy, I know, but it worked for us when using RabbitMQ 2.8.5, and usually works for us on 3.2.4 (see the issue below). Someday we may succeed in making the case that we need a second development Rabbit cluster, but today is not that day.</p>
<p style="margin:0px 0px 1em;padding:0px;border:0px;font-size:14px;vertical-align:baseline;clear:both;color:rgb(0,0,0);font-family:Arial,'Liberation Sans','DejaVu Sans',sans-serif;line-height:17.804800033569336px">
<strong style="margin:0px;padding:0px;border:0px;vertical-align:baseline;background-color:transparent">How I setup intra-cluster federation:</strong></p><p style="margin:0px 0px 1em;padding:0px;border:0px;font-size:14px;vertical-align:baseline;clear:both;color:rgb(0,0,0);font-family:Arial,'Liberation Sans','DejaVu Sans',sans-serif;line-height:17.804800033569336px">
Once the cluster is configured and all nodes report ready, I run a test that does the following. It always talks to one node in the cluster; it doesn't load-balance yet (once the test becomes stable, we will use a DNS load balancer across both cluster nodes).</p>
<p style="margin:0px 0px 1em;padding:0px;border:0px;font-size:14px;vertical-align:baseline;clear:both;color:rgb(0,0,0);font-family:Arial,'Liberation Sans','DejaVu Sans',sans-serif;line-height:17.804800033569336px">
Frst, it sets up its environment. It creates a test user and a pair of testing virtual hosts, granting the appropriate permissions, and ensuring that it can access both vhosts via AMQP and the API as that user. Then, it sets up the federation links by creating an upstream on the first vhost (call it A) pointing to the second vhost on the same cluster (call it vhost B). Then, it creates a policy which applies the upstream (directly; we only use one, so I don't create any upstream sets, though I could if need be) to an exchange. It then polls the federation link via the API until it reports a status of "running". I could also poll the exchange for existence on the upstream Vhost, but I haven't seen a need to do so.</p>
<p style="margin:0px 0px 1em;padding:0px;border:0px;font-size:14px;vertical-align:baseline;clear:both;color:rgb(0,0,0);font-family:Arial,'Liberation Sans','DejaVu Sans',sans-serif;line-height:17.804800033569336px">
Then, to make sure federation is actually usable, I publish AMQP messages into the federated exchange on B, and make sure they are received on A.</p><p style="margin:0px 0px 1em;padding:0px;border:0px;font-size:14px;vertical-align:baseline;clear:both;color:rgb(0,0,0);font-family:Arial,'Liberation Sans','DejaVu Sans',sans-serif;line-height:17.804800033569336px">
This test usually works. We did something similar on RabbitMQ 2.8.5, and it worked flawlessly (without the policies; using federation upstreams etc. configured in the file).</p><p style="margin:0px 0px 1em;padding:0px;border:0px;font-size:14px;vertical-align:baseline;clear:both;color:rgb(0,0,0);font-family:Arial,'Liberation Sans','DejaVu Sans',sans-serif;line-height:17.804800033569336px">
<strong style="margin:0px;padding:0px;border:0px;vertical-align:baseline;background-color:transparent">The problem:</strong></p><p style="margin:0px 0px 1em;padding:0px;border:0px;font-size:14px;vertical-align:baseline;clear:both;color:rgb(0,0,0);font-family:Arial,'Liberation Sans','DejaVu Sans',sans-serif;line-height:17.804800033569336px">
This test <em style="margin:0px;padding:0px;border:0px;vertical-align:baseline;background-color:transparent">usually</em> works. However, sometimes (about 20% of the time, by my estimate) it breaks the cluster. Occasionally, between 1 and 5 minutes after running the test, the cluster stops being fully responsive. A full description of the symptoms follows:</p>
<ul style="margin:0px 0px 1em 30px;padding:0px;border:0px;font-size:14px;vertical-align:baseline;list-style-position:initial;color:rgb(0,0,0);font-family:Arial,'Liberation Sans','DejaVu Sans',sans-serif;line-height:17.804800033569336px">
<li style="margin:0px;padding:0px;border:0px;vertical-align:baseline;background-color:transparent">When the cluster is in this state, I can open AMQP connections (authentication, tuning, etc all complete successfully), but I get no RPC replies for any subsequent AMQP methods (declare*, consume, etc).</li>
<li style="margin:0px;padding:0px;border:0px;vertical-align:baseline;background-color:transparent">The management API will not load the "Admin" section, doing nothing in GUI mode, and returning <code style="margin:0px;padding:1px 5px;border:0px;vertical-align:baseline;background-color:rgb(238,238,238);font-family:Consolas,Menlo,Monaco,'Lucida Console','Liberation Mono','DejaVu Sans Mono','Bitstream Vera Sans Mono','Courier New',monospace,serif;white-space:pre-wrap">[]</code> in JSON mode.</li>
<li style="margin:0px;padding:0px;border:0px;vertical-align:baseline;background-color:transparent">Attempting to publish a message via the management UI results in the entire API becoming permanently inaccessible (pages only render the RabbitMQ logo and no other components, and JSON requests time out).</li>
<li style="margin:0px;padding:0px;border:0px;vertical-align:baseline;background-color:transparent">Nothing is written to the log (main file or SASL) after a few seconds before the outage starts. The last log entry is unpredictable; sometimes it is a record of the test user created for federation having its password change, sometimes it is a record of test Vhost being automatically deleted after the test, sometimes it is a record of an AMQP connection being accepted. All subsequent events are not logged (even successful AMQP connections, though, as detailed above, I can't do anything via RPC once connected).</li>
<li style="margin:0px;padding:0px;border:0px;vertical-align:baseline;background-color:transparent">Both the Beam and rabbitmq-server processes on both nodes consume minimal resources while hung; running <code style="margin:0px;padding:1px 5px;border:0px;vertical-align:baseline;background-color:rgb(238,238,238);font-family:Consolas,Menlo,Monaco,'Lucida Console','Liberation Mono','DejaVu Sans Mono','Bitstream Vera Sans Mono','Courier New',monospace,serif;white-space:pre-wrap">strace</code> on <code style="margin:0px;padding:1px 5px;border:0px;vertical-align:baseline;background-color:rgb(238,238,238);font-family:Consolas,Menlo,Monaco,'Lucida Console','Liberation Mono','DejaVu Sans Mono','Bitstream Vera Sans Mono','Courier New',monospace,serif;white-space:pre-wrap">rabbitmq-server</code>, <code style="margin:0px;padding:1px 5px;border:0px;vertical-align:baseline;background-color:rgb(238,238,238);font-family:Consolas,Menlo,Monaco,'Lucida Console','Liberation Mono','DejaVu Sans Mono','Bitstream Vera Sans Mono','Courier New',monospace,serif;white-space:pre-wrap">beam</code>, and the Erlang helper <code style="margin:0px;padding:1px 5px;border:0px;vertical-align:baseline;background-color:rgb(238,238,238);font-family:Consolas,Menlo,Monaco,'Lucida Console','Liberation Mono','DejaVu Sans Mono','Bitstream Vera Sans Mono','Courier New',monospace,serif;white-space:pre-wrap">inet_gethost</code>processes just shows a lot of system <code style="margin:0px;padding:1px 5px;border:0px;vertical-align:baseline;background-color:rgb(238,238,238);font-family:Consolas,Menlo,Monaco,'Lucida Console','Liberation Mono','DejaVu Sans Mono','Bitstream Vera Sans Mono','Courier New',monospace,serif;white-space:pre-wrap">select</code> calls either sticking indefinitely or timing out and being immediately retried.</li>
<li style="margin:0px;padding:0px;border:0px;vertical-align:baseline;background-color:transparent">Checking the open sockets for processes owned by the rabbitmq user shows that there are two (bidirectional, two sockets each) IPV4 links from the master node of the cluster to itself (presumably those are the federation links, or some other manager process). It also shows every API HTTP connection initiated since the start of the outage in <code style="margin:0px;padding:1px 5px;border:0px;vertical-align:baseline;background-color:rgb(238,238,238);font-family:Consolas,Menlo,Monaco,'Lucida Console','Liberation Mono','DejaVu Sans Mono','Bitstream Vera Sans Mono','Courier New',monospace,serif;white-space:pre-wrap">CLOSE_WAIT</code> status.</li>
</ul><p style="margin:0px 0px 1em;padding:0px;border:0px;font-size:14px;vertical-align:baseline;clear:both;color:rgb(0,0,0);font-family:Arial,'Liberation Sans','DejaVu Sans',sans-serif;line-height:17.804800033569336px">
Some interesting things happen when I try to shut down the cluster:</p><ul style="margin:0px 0px 1em 30px;padding:0px;border:0px;font-size:14px;vertical-align:baseline;list-style-position:initial;color:rgb(0,0,0);font-family:Arial,'Liberation Sans','DejaVu Sans',sans-serif;line-height:17.804800033569336px">
<li style="margin:0px;padding:0px;border:0px;vertical-align:baseline;background-color:transparent">Attempting to shut down either node of the cluster via the init scripts or <code style="margin:0px;padding:1px 5px;border:0px;vertical-align:baseline;background-color:rgb(238,238,238);font-family:Consolas,Menlo,Monaco,'Lucida Console','Liberation Mono','DejaVu Sans Mono','Bitstream Vera Sans Mono','Courier New',monospace,serif;white-space:pre-wrap">rabbitmqctl</code> hangs indefinitely. Nothing is written to <code style="margin:0px;padding:1px 5px;border:0px;vertical-align:baseline;background-color:rgb(238,238,238);font-family:Consolas,Menlo,Monaco,'Lucida Console','Liberation Mono','DejaVu Sans Mono','Bitstream Vera Sans Mono','Courier New',monospace,serif;white-space:pre-wrap">shutdown_err</code> unless I interrupt the hung shutdown, in which case a notification of the interrupt is logged.</li>
<li style="margin:0px;padding:0px;border:0px;vertical-align:baseline;background-color:transparent">After an attempted shutdown the management API/UI becomes inaccessible, and any hung connections that were in <code style="margin:0px;padding:1px 5px;border:0px;vertical-align:baseline;background-color:rgb(238,238,238);font-family:Consolas,Menlo,Monaco,'Lucida Console','Liberation Mono','DejaVu Sans Mono','Bitstream Vera Sans Mono','Courier New',monospace,serif;white-space:pre-wrap">CLOSE_WAIT</code> status dissapear.</li>
<li style="margin:0px;padding:0px;border:0px;vertical-align:baseline;background-color:transparent">All RabbitMQ processes that were visible before the shutdown attempt remain visible. AMQP operations are still impossible.</li>
<li style="margin:0px;padding:0px;border:0px;vertical-align:baseline;background-color:transparent">Killing the RabbitMQ processes (via <code style="margin:0px;padding:1px 5px;border:0px;vertical-align:baseline;background-color:rgb(238,238,238);font-family:Consolas,Menlo,Monaco,'Lucida Console','Liberation Mono','DejaVu Sans Mono','Bitstream Vera Sans Mono','Courier New',monospace,serif;white-space:pre-wrap">SIGTERM</code>) and then restarting the master node in the cluster resolves the issue. Rabbit starts back up, and all objects, federation links, etc. created before the outage are available and working.</li>
</ul><p style="margin:0px 0px 1em;padding:0px;border:0px;font-size:14px;vertical-align:baseline;clear:both;color:rgb(0,0,0);font-family:Arial,'Liberation Sans','DejaVu Sans',sans-serif;line-height:17.804800033569336px">
<strong style="margin:0px;padding:0px;border:0px;vertical-align:baseline;background-color:transparent">Questions:</strong></p><ol style="margin:0px 0px 1em 30px;padding:0px;border:0px;font-size:14px;vertical-align:baseline;list-style-position:initial;color:rgb(0,0,0);font-family:Arial,'Liberation Sans','DejaVu Sans',sans-serif;line-height:17.804800033569336px">
<li style="margin:0px;padding:0px;border:0px;vertical-align:baseline;background-color:transparent">Why is this happening?</li><li style="margin:0px;padding:0px;border:0px;vertical-align:baseline;background-color:transparent">
Is this only happening because of the strange intra-cluster federation setup I have, or is there another cause?</li><li style="margin:0px;padding:0px;border:0px;vertical-align:baseline;background-color:transparent">How do I resolve it?</li>
</ol></div></div>