[rabbitmq-discuss] Issue with rabbit starting up every time

Chris Madden chris.madden at gmail.com
Fri Jul 1 13:17:30 BST 2011


On Friday, July 1, 2011 at 8:07 AM, Matthias Radestock wrote:

> Chris,
> 
> On 30/06/11 20:47, Chris Madden wrote:
> > I have 2 nodes in a cluster, both are disc nodes. Occasionally,
> > following a reboot, rabbit will not start. [...]
> > Interestingly, it seems to correct itself if I continue to restart
> > rabbit. Sometimes it can take 15-20 attempts to get it to start
> > correctly.
> 
> Are you restarting both nodes or just one? And when you are "restarting 
> rabbit", are you just restarting the rabbitmq server process or 
> rebooting the entire machine?
> 
The problem usually manifests after one of the machines reboots. I have a baby sitter that does 2 things: 1) make sure rabbit is up and 2) make sure the cluster is fully formed. In either test fails, it will issue an '/etc/init.d/rabbitmq-server restart'

When we get into this state, it is usually our watchdog that gets things back up and running.

Perhaps our watchdogs are running too frequently (once a minute). I'll back off their aggressiveness to see if that doesn't help things.

Further, the machines involved here are virtual machines. I see the problem more when I have a single CPU than when I have multiple.

> > I'm suspicious of
> > http://hg.rabbitmq.com/rabbitmq-server/file/5f84b55205fd/src/rabbit_mnesia.erl#l610,
> > with a hard coded timeout a heavily loaded system (which this is
> > definitely at boot time) may take more than 30 seconds (assuming I'm
> > reading that correctly).
> 
> We filed a bug back in 2009 to come up with something better than the 30 
> second timeout. But until now we've had no evidence that it is actually 
> causing problems. Yes, users have been reporting rabbit failing to start 
> with a timeout_waiting_for_tables error, but in all cases I recall the 
> underlying problem wasn't the timeout duration, i.e. increasing the 
> timeout would simply have led to waiting for longer and then still failing.
> 
> So another thing to try would be to increase the timeout in the code and 
> see whether that changes the behaviour you are seeing or merely delays 
> the failure.

I will try this, thanks! 


More information about the rabbitmq-discuss mailing list