[rabbitmq-discuss] Issue with rabbit starting up every time
chris.madden at gmail.com
Fri Jul 1 13:17:30 BST 2011
On Friday, July 1, 2011 at 8:07 AM, Matthias Radestock wrote:
> On 30/06/11 20:47, Chris Madden wrote:
> > I have 2 nodes in a cluster, both are disc nodes. Occasionally,
> > following a reboot, rabbit will not start. [...]
> > Interestingly, it seems to correct itself if I continue to restart
> > rabbit. Sometimes it can take 15-20 attempts to get it to start
> > correctly.
> Are you restarting both nodes or just one? And when you are "restarting
> rabbit", are you just restarting the rabbitmq server process or
> rebooting the entire machine?
The problem usually manifests after one of the machines reboots. I have a baby sitter that does 2 things: 1) make sure rabbit is up and 2) make sure the cluster is fully formed. In either test fails, it will issue an '/etc/init.d/rabbitmq-server restart'
When we get into this state, it is usually our watchdog that gets things back up and running.
Perhaps our watchdogs are running too frequently (once a minute). I'll back off their aggressiveness to see if that doesn't help things.
Further, the machines involved here are virtual machines. I see the problem more when I have a single CPU than when I have multiple.
> > I'm suspicious of
> > http://hg.rabbitmq.com/rabbitmq-server/file/5f84b55205fd/src/rabbit_mnesia.erl#l610,
> > with a hard coded timeout a heavily loaded system (which this is
> > definitely at boot time) may take more than 30 seconds (assuming I'm
> > reading that correctly).
> We filed a bug back in 2009 to come up with something better than the 30
> second timeout. But until now we've had no evidence that it is actually
> causing problems. Yes, users have been reporting rabbit failing to start
> with a timeout_waiting_for_tables error, but in all cases I recall the
> underlying problem wasn't the timeout duration, i.e. increasing the
> timeout would simply have led to waiting for longer and then still failing.
> So another thing to try would be to increase the timeout in the code and
> see whether that changes the behaviour you are seeing or merely delays
> the failure.
I will try this, thanks!
More information about the rabbitmq-discuss