[rabbitmq-discuss] timeout_waiting_for_tables on node that has not changed node name

Tue Dec 13 11:38:50 GMT 2011

Just to be clear, are you saying that resolv.conf lacked a "search" line 
and thus DNS queries for short hostnames did not work? Or something else?

That does look like something we could detect / warn about. Hmm.

Cheers, Simon

On 09/12/11 22:33, Elias Levy wrote:
> We figured it out on our own.  The cluster is using short names for node
> names.  For some reason the EC2 DHCP client failed to set the domain
> entry in resolv.conf, thus the restarted node's mnesia could not
> communicate with other nodes.
>
> Yet the errors and logs made no mention of the fact that the node's IP
> could not be resolved.  Had such an error been printed the problem could
> have been diagnosed and fixed in a couple of minutes.
>
> Can someone please open a trouble ticket to get some sort of error
> logged for this cases?
>
> I have to imagine many of the issues reported
> as timeout_waiting_for_tables errors are of similar origin.
>
>
>
> On Fri, Dec 9, 2011 at 12:33 PM, Elias Levy <fearsome.lucidity at gmail.com
> <mailto:fearsome.lucidity at gmail.com>> wrote:
>
>     Last night we had to reboot a RabbitMQ node in a 3 node cluster
>     within EC2.  The node failed to restart with the
>     dreaded timeout_waiting_for_tables error.
>
>     Looking as past discussion on that topic it is clear that the most
>     common reason for it is a node name change, either because the node
>     name contains the IP address, the hostname changed, or a new node is
>     being provisioned on an image with an old mnesia DB with some other
>     nodename.
>
>     None of those appears to apply in our current situation.  The node
>     name does not include the IP address and the node name did not
>     change, as can be seen in the start up logs.  Just to be sure we set
>     the node name in the /etc/rabbitmq/rabbitmq-env.conf file and
>     attempted to restart, again without success.
>
>     I enabled mnesia debugging at the trace level and it does not
>     provide any useful information as to what is causing the timeout.
>       The cluster has developed a backlog of persistent messages in two
>     of the queues (about 70K in total), but from looking at what tables
>     the system complains about it does not appear those are the tables
>     its trying to sync.  All the other metadata (users, exchanges,
>     bindings, queues) is of very small size, so 30 seconds should be
>     sufficient time.
>
>     While we could wipe the mnesia state from the node, we'd like to
>     find out why this happens and whether it can be repaired, for future
>     reference.
>
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss

-- 
Simon MacMullen
RabbitMQ, VMware