[rabbitmq-discuss] timeout_waiting_for_tables on node that has not changed node name
Simon MacMullen
simon at rabbitmq.com
Tue Dec 13 11:38:50 GMT 2011
Just to be clear, are you saying that resolv.conf lacked a "search" line
and thus DNS queries for short hostnames did not work? Or something else?
That does look like something we could detect / warn about. Hmm.
Cheers, Simon
On 09/12/11 22:33, Elias Levy wrote:
> We figured it out on our own. The cluster is using short names for node
> names. For some reason the EC2 DHCP client failed to set the domain
> entry in resolv.conf, thus the restarted node's mnesia could not
> communicate with other nodes.
>
> Yet the errors and logs made no mention of the fact that the node's IP
> could not be resolved. Had such an error been printed the problem could
> have been diagnosed and fixed in a couple of minutes.
>
> Can someone please open a trouble ticket to get some sort of error
> logged for this cases?
>
> I have to imagine many of the issues reported
> as timeout_waiting_for_tables errors are of similar origin.
>
>
>
> On Fri, Dec 9, 2011 at 12:33 PM, Elias Levy <fearsome.lucidity at gmail.com
> <mailto:fearsome.lucidity at gmail.com>> wrote:
>
> Last night we had to reboot a RabbitMQ node in a 3 node cluster
> within EC2. The node failed to restart with the
> dreaded timeout_waiting_for_tables error.
>
> Looking as past discussion on that topic it is clear that the most
> common reason for it is a node name change, either because the node
> name contains the IP address, the hostname changed, or a new node is
> being provisioned on an image with an old mnesia DB with some other
> nodename.
>
> None of those appears to apply in our current situation. The node
> name does not include the IP address and the node name did not
> change, as can be seen in the start up logs. Just to be sure we set
> the node name in the /etc/rabbitmq/rabbitmq-env.conf file and
> attempted to restart, again without success.
>
> I enabled mnesia debugging at the trace level and it does not
> provide any useful information as to what is causing the timeout.
> The cluster has developed a backlog of persistent messages in two
> of the queues (about 70K in total), but from looking at what tables
> the system complains about it does not appear those are the tables
> its trying to sync. All the other metadata (users, exchanges,
> bindings, queues) is of very small size, so 30 seconds should be
> sufficient time.
>
> While we could wipe the mnesia state from the node, we'd like to
> find out why this happens and whether it can be repaired, for future
> reference.
>
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
--
Simon MacMullen
RabbitMQ, VMware
More information about the rabbitmq-discuss
mailing list