[rabbitmq-discuss] timeout_waiting_for_tables on node that has not changed node name

Fri Dec 9 22:33:47 GMT 2011

We figured it out on our own.  The cluster is using short names for node
names.  For some reason the EC2 DHCP client failed to set the domain entry
in resolv.conf, thus the restarted node's mnesia could not communicate with
other nodes.

Yet the errors and logs made no mention of the fact that the node's IP
could not be resolved.  Had such an error been printed the problem could
have been diagnosed and fixed in a couple of minutes.

Can someone please open a trouble ticket to get some sort of error logged
for this cases?

I have to imagine many of the issues reported as timeout_waiting_for_tables
errors are of similar origin.

On Fri, Dec 9, 2011 at 12:33 PM, Elias Levy <fearsome.lucidity at gmail.com>wrote:

> Last night we had to reboot a RabbitMQ node in a 3 node cluster within
> EC2.  The node failed to restart with the
> dreaded timeout_waiting_for_tables error.
>
> Looking as past discussion on that topic it is clear that the most common
> reason for it is a node name change, either because the node name contains
> the IP address, the hostname changed, or a new node is being provisioned on
> an image with an old mnesia DB with some other nodename.
>
> None of those appears to apply in our current situation.  The node name
> does not include the IP address and the node name did not change, as can be
> seen in the start up logs.  Just to be sure we set the node name in the
> /etc/rabbitmq/rabbitmq-env.conf file and attempted to restart, again
> without success.
>
> I enabled mnesia debugging at the trace level and it does not provide any
> useful information as to what is causing the timeout.  The cluster has
> developed a backlog of persistent messages in two of the queues (about 70K
> in total), but from looking at what tables the system complains about it
> does not appear those are the tables its trying to sync.  All the other
> metadata (users, exchanges, bindings, queues) is of very small size, so 30
> seconds should be sufficient time.
>
> While we could wipe the mnesia state from the node, we'd like to find out
> why this happens and whether it can be repaired, for future reference.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20111209/b468af3e/attachment.htm>