<div dir="ltr">SO a final email on this. I ended up having to kill all the processes on all nodes in the cluster, then starting them back up in order to recover. At that point, the node that wouldn't rejoin the cluster came online and started syncing messages and responding fine. I'm guessing I had a deadlock someplace though I'm not totally sure where it would be. I'll keep an eye on this and see what else I can discover. *SIGH* I really need to learn to debug and work with erlang better,<br>
Thanks all,<br>Jason<br></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Apr 10, 2014 at 4:22 PM, Jason McIntosh <span dir="ltr"><<a href="mailto:mcintoshj@gmail.com" target="_blank">mcintoshj@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>SO now the fun part. I decided to try and rebuild the middle node (I have boxes 10, 11 and 12). However, I can't get the middle node to reconnect to the cluster. Removing it's mnesia directory allowed it to start, but it can't rejoin the cluster. SO I tried removing the node from the cluster, e.g.:<br>
<br>rabbitmqctl -n cluster@rabbitmqm10 forget_cluster_node cluster@rabbitmqm11<br><br></div>But the above never responds - it's just sitting there hanging. <br><br>rabbitmqctl -n cluster@rabbitmqm11 status FROM the other nodes all works fine. I'm about at a loss as to how the heck to repair things. I can't remove the node from the cluster, I can't start it with the mnesia directory in it's current state, and removing the mnesia directory and trying to add it back in is failing - it fails with "....done (already_member).". Trying to do rabbitmqctl update_cluster_nodes cluster@rabbitmqm10 is sitting there doing nothing and not responding either.<br>
<div><br><br></div><div> I'm starting to really worry I'm going to have to completely rebuild my cluster...<span class="HOEnZb"><font color="#888888"><br>Jason<br><br></font></span></div></div><div class="HOEnZb">
<div class="h5"><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Apr 10, 2014 at 2:55 PM, Jason McIntosh <span dir="ltr"><<a href="mailto:mcintoshj@gmail.com" target="_blank">mcintoshj@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div>Not sure what's going on here. Just ugpraded my cluster from 3.2.3 to 3.2.4 (including a restart of the machine). On startup, two of my initial nodes started fine, but when the third node in the cluster started, the "/etc/init.d/rabbitmq-server start" just sits at "Starting rabbitmq-server: " without ever finishing. Doing a rabbitmqctl status shows:<br>
Status of node cluster@rabbitmqm11p ...<br>[{pid,62505},<br> {running_applications,[{os_mon,"CPO CXC 138 46","2.2.14"},<br> {inets,"INETS CXC 138 49","5.9.8"},<br>
{mnesia,"MNESIA CXC 138 12","4.11"},<br> {amqp_client,"RabbitMQ AMQP Client","3.2.4"},<br> {xmerl,"XML parser","1.3.6"},<br>
{eldap,"Ldap api","1.0.2"},<br> {sasl,"SASL CXC 138 11","2.3.4"},<br> {stdlib,"ERTS CXC 138 10","1.19.4"},<br>
{kernel,"ERTS CXC 138 10","2.16.4"}]},<br> {os,{unix,linux}},<br> {erlang_version,"Erlang R16B03-1 (erts-5.10.4) [source] [64-bit] [smp:24:24] [async-threads:30] [hipe] [kernel-poll:true]\n"},<br>
{memory,[{total,48504352},<br> {connection_procs,2808},<br> {queue_procs,0},<br> {plugins,0},<br> {other_proc,16290632},<br> {mnesia,1783536},<br> {mgmt_db,0},<br> {msg_index,0},<br>
{other_ets,1120896},<br> {binary,725448},<br> {code,19691642},<br> {atom,703377},<br> {other_system,8186013}]},<br> {file_descriptors,[{total_limit,12188},<br> {total_used,0},<br>
{sockets_limit,10967},<br> {sockets_used,0}]},<br> {processes,[{limit,1048576},{used,117}]},<br> {run_queue,0},<br> {uptime,83}]<br>...done.<br><br><br></div>In the web management interface, I see this:<br>
Node statistics not available<br><h2>Memory details</h2>
<div>
<div style="width:0px" title="Connections 2.7kB">
</div>
<div style="width:0px" title="Queues 0B">
</div>
<div style="width:0px" title="Plugins 0B">
</div>
<div style="width:269px" title="Other process memory 16MB">
</div>
<div style="width:29px" title="Mnesia 1.7MB">
</div>
<div style="width:0px" title="Message store index 0B">
</div>
<div style="width:0px" title="Management database 0B">
</div>
<div style="width:18px" title="Other ETS tables 1.1MB">
</div>
<div style="width:12px" title="Binaries 708kB">
</div>
<div style="width:325px" title="Code 19MB">
</div>
<div style="width:12px" title="Atoms 687kB">
</div>
<div style="width:134px" title="Other system 7.8MB">
</div>
</div>
<span> </span>
<div>
<table>
<tbody><tr>
<th>Connections</th>
<td>2.7kB</td>
</tr>
<tr>
<th>Queues</th>
<td>0B</td>
</tr>
<tr>
<th>Plugins</th>
<td>0B</td>
</tr>
<tr>
<th>Other process memory</th>
<td>16MB</td>
</tr>
</tbody></table>
<table>
<tbody><tr>
<th>Mnesia</th>
<td>1.7MB</td>
</tr>
<tr>
<th>Message store index</th>
<td>0B</td>
</tr>
<tr>
<th>Management database</th>
<td>0B</td>
</tr>
<tr>
<th>Other ETS tables</th>
<td>1.1MB</td>
</tr>
</tbody></table>
<table>
<tbody><tr>
<th>Binaries</th>
<td>708kB</td>
</tr>
<tr>
<th>Code</th>
<td>19MB</td>
</tr>
<tr>
<th>Atoms</th>
<td>687kB</td>
</tr>
<tr>
<th>Other system</th>
<td>7.8MB</td>
</tr>
</tbody></table>
</div><br><br></div>SO rabbit appears to have sort of started, but certain things are not started (e.g. plugins). Plugins list is:<br>[e] amqp_client 3.2.4<br>[ ] cowboy 0.5.0-rmq3.2.4-git4b93c2d<br>
[ ] eldap 3.2.4-gite309de4<br>[e] mochiweb 2.7.0-rmq3.2.4-git680dba8<br>[ ] rabbitmq_amqp1_0 3.2.4<br>[E] rabbitmq_auth_backend_ldap 3.2.4<br>[ ] rabbitmq_auth_mechanism_ssl 3.2.4<br>
[E] rabbitmq_consistent_hash_exchange 3.2.4<br>[E] rabbitmq_federation 3.2.4<br>[E] rabbitmq_federation_management 3.2.4<br>[ ] rabbitmq_jsonrpc 3.2.4<br>[ ] rabbitmq_jsonrpc_channel 3.2.4<br>
[ ] rabbitmq_jsonrpc_channel_examples 3.2.4<br>[E] rabbitmq_management 3.2.4<br>[E] rabbitmq_management_agent 3.2.4<br>[E] rabbitmq_management_visualiser 3.2.4<br>[ ] rabbitmq_mqtt 3.2.4<br>
[E] rabbitmq_shovel 3.2.4<br>[E] rabbitmq_shovel_management 3.2.4<br>[ ] rabbitmq_stomp 3.2.4<br>[ ] rabbitmq_tracing 3.2.4<br>[e] rabbitmq_web_dispatch 3.2.4<br>
[ ] rabbitmq_web_stomp 3.2.4<br>[ ] rabbitmq_web_stomp_examples 3.2.4<br>[ ] rfc4627_jsonrpc 3.2.4-git5e67120<br>[ ] sockjs 0.3.4-rmq3.2.4-git3132eb9<br>[e] webmachine 1.10.3-rmq3.2.4-gite9359c7<br>
<br><br>Any suggestions on next steps on debugging this? Or what I can do to get this back up and in a "healthy" state?<br><br>Thanks!<span><font color="#888888"><br>Jason<br><div><br><br><div><br clear="all">
<div><br>-- <br><div dir="ltr">
Jason McIntosh<br><a href="https://github.com/jasonmcintosh/" target="_blank">https://github.com/jasonmcintosh/</a><br><a href="tel:573-424-7612" value="+15734247612" target="_blank">573-424-7612</a></div>
</div></div></div></font></span></div>
</blockquote></div><br><br clear="all"><br>-- <br><div dir="ltr">Jason McIntosh<br><a href="https://github.com/jasonmcintosh/" target="_blank">https://github.com/jasonmcintosh/</a><br><a href="tel:573-424-7612" value="+15734247612" target="_blank">573-424-7612</a></div>
</div>
</div></div></blockquote></div><br><br clear="all"><br>-- <br><div dir="ltr">Jason McIntosh<br><a href="https://github.com/jasonmcintosh/" target="_blank">https://github.com/jasonmcintosh/</a><br>573-424-7612</div>
</div>