[rabbitmq-discuss] rabbitmqctl stall/hang when leaving a cluster

Thu Feb 23 21:00:27 GMT 2012

| I *think* the stats database is a red herring.

Perhaps. But it's the only correlation that I've seen. That is, I've never
seen it happen on node that didn't have the stats database before it shut
down.

A little more background context: I'm writing "rolling restart" logic. For
each node in the cluster, in sequence, I stop the node, perform update
logic (currently nothing), then restart the node.

| You say this happens when restarting?

Yes. Occasionally the node will restart OK, but more often than not, it
hangs on the "rabbitmqctl wait"

I modified my script to run rabbitmq-server as a background task. Also,
worth noting that these scripts are invoked remotely via Capistrano, so
until I prefaced them with nohup, the server would start then immediately
exit. The invocation line now looks like this:

nohup rabbitmq-server &

The nohup.out on the failing node ends with:

+---+   +---+
|   |   |   |
|   |   |   |
|   |   |   |
|   +---+   +-------+
|                   |
| RabbitMQ  +---+   |
|           |   |   |
|   v2.7.1  +---+   |
|                   |
+-------------------+
AMQP 0-9-1 / 0-9 / 0-8
Copyright (C) 2007-2011 VMware, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/

node           : rabbit at play2
app descriptor :
/usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/../ebin/rabbit.app
home dir       : /home/mpietrek
config file(s) : /home/mpietrek/work/var/run/rabbitmq.config
cookie hash    : pS5H9kY3Wra/XdLEKT5hgQ==
log            : /home/mpietrek/work/logs/
play2.mpietrek.internal.illumita.com/rabbit at play2.log
sasl log       : /home/mpietrek/work/logs/
play2.mpietrek.internal.illumita.com/rabbit at play2-sasl.log
database dir   : /home/mpietrek/work/var/lib/rabbit at play2
erlang version : 5.7.4

-- rabbit boot start
starting file handle cache server
...done
starting worker pool
...done
starting database                                                     ...

On Thu, Feb 23, 2012 at 3:52 AM, Simon MacMullen <simon at rabbitmq.com> wrote:

> I *think* the stats database is a red herring. You say this happens when
> restarting?
>
>
> On 23/02/12 00:30, Matt Pietrek wrote:
>
>> Let me add some additional information, and re-summarize what I'm seeing.
>>
>> In our startup script for RabbitMQ we do the following;
>>
>> rabbitmq-server -detached
>> rabbitmqctl status
>> <Extract the PID from rabbitmqctl status, write to our PIDFILE>
>>
>
> There's a potential race here if an old server is running (maybe about to
> shut down?). rabbitmqctl status could pick up the old pid.
>
>  rabbitmqctl wait PIDFILE
>>
>
> However, rabbitmqctl wait should then detect that the pid has died and
> fail. Unless the pid gets reused by the OS but that is presumably unlikely.
>
> But rabbitmqctl wait will wait indefinitely as long as the pid is alive
> and not a fully functional rabbit node. So I'd check two things:
>
> 1) You should fix that race, it can be done safely:
>
> Do not use rabbitmq-server -detached and rabbitmqctl status to get the
> pid. Instead set RABBITMQ_PID_FILE and background the rabbitmq-server
> script. You will then *definitely* get the right pid since the script
> writes its own pid then execs - no race possible.
>
> 2) Capture the stdout of rabbitmq-server when you start it - if
> rabbitmqctl wait still hangs, see how far it's got / what it's doing.
>
>
> Cheers, Simon
>
> --
> Simon MacMullen
> RabbitMQ, VMware
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120223/bbd5ed6d/attachment.htm>