We figured it out on our own. The cluster is using short names for node names. For some reason the EC2 DHCP client failed to set the domain entry in resolv.conf, thus the restarted node's mnesia could not communicate with other nodes.<div>
<br></div><div>Yet the errors and logs made no mention of the fact that the node's IP could not be resolved. Had such an error been printed the problem could have been diagnosed and fixed in a couple of minutes.</div>
<div><br></div><div>Can someone please open a trouble ticket to get some sort of error logged for this cases?</div><div><br></div><div>I have to imagine many of the issues reported as timeout_waiting_for_tables errors are of similar origin.</div>
<div><br><div><br><br><div class="gmail_quote">On Fri, Dec 9, 2011 at 12:33 PM, Elias Levy <span dir="ltr"><<a href="mailto:fearsome.lucidity@gmail.com">fearsome.lucidity@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Last night we had to reboot a RabbitMQ node in a 3 node cluster within EC2. The node failed to restart with the dreaded timeout_waiting_for_tables error. <div><br></div><div>Looking as past discussion on that topic it is clear that the most common reason for it is a node name change, either because the node name contains the IP address, the hostname changed, or a new node is being provisioned on an image with an old mnesia DB with some other nodename.</div>
<div><br></div><div>None of those appears to apply in our current situation. The node name does not include the IP address and the node name did not change, as can be seen in the start up logs. Just to be sure we set the node name in the /etc/rabbitmq/rabbitmq-env.conf file and attempted to restart, again without success.</div>
<div><br></div><div>I enabled mnesia debugging at the trace level and it does not provide any useful information as to what is causing the timeout. The cluster has developed a backlog of persistent messages in two of the queues (about 70K in total), but from looking at what tables the system complains about it does not appear those are the tables its trying to sync. All the other metadata (users, exchanges, bindings, queues) is of very small size, so 30 seconds should be sufficient time.</div>
<div><br></div><div>While we could wipe the mnesia state from the node, we'd like to find out why this happens and whether it can be repaired, for future reference.</div><div><br></div></blockquote></div></div></div>