[rabbitmq-discuss] rabbitmqctl stall/hang when leaving a cluster

Wed Mar 14 17:06:21 GMT 2012

Following up on this, I've gone back and looked at all the logs I can think
of.

On the node that hangs (stuck on "starting database   ..."), here's the
console output:

+---+   +---+
|   |   |   |
|   |   |   |
|   |   |   |
|   +---+   +-------+
|                   |
| RabbitMQ  +---+   |
|           |   |   |
|   v2.7.1  +---+   |
|                   |
+-------------------+
AMQP 0-9-1 / 0-9 / 0-8
Copyright (C) 2007-2011 VMware, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/

node           : rabbit at play
app descriptor :
/usr/lib/rabbitmq/lib/rabbitmq_server-2.7.1/sbin/../ebin/rabbit.app
home dir       : /home/mpietrek
config file(s) : /home/mpietrek/work/var/run/rabbitmq.config
cookie hash    : pR5H9kY3Wra/XdLELT5hgQ==
log            : /home/mpietrek/work/logs/
play.mpietrek.internal.illumita.com/rabbit at play.log
sasl log       : /home/mpietrek/work/logs/
play.mpietrek.internal.illumita.com/rabbit at play-sasl.log
database dir   : /home/mpietrek/work/var/lib/rabbit at play
erlang version : 5.7.4

-- rabbit boot start
starting file handle cache server
...done
starting worker pool
...done
starting database

And this is the last output in the log file:

=INFO REPORT==== 14-Mar-2012::09:50:33 ===
Limiting to approx 924 file handles (829 sockets)

On the node that's the master (labeled "disc stats" in the Overview tab),
there's nothing in the log about the new node joining.

Is there anyplace else I should be looking for clues to assist you? This
issue is a pretty big spanner in the works for our rolling upgrade scenario.

Thanks much,

Matt

On Tue, Mar 13, 2012 at 3:00 PM, Matt Pietrek <mpietrek at skytap.com> wrote:

> Some other work came up so I needed to drop this thread for a few weeks.
> However, coming back to it, I can easily reproduce this issue within one or
> two tries.
>
> In a nutshell, in a clustered environment, simply stop one node, wait a
> few seconds, then restart it. The last output seen is:
>
> starting database
>                                ...
>
> I've let it wait for much longer than 30 seconds and it has never come
> back.
>
> Any chance this may have been stamped out in RabbitMQ 2.8?
>
>
>
> On Fri, Feb 24, 2012 at 1:43 PM, Matt Pietrek <mpietrek at skytap.com> wrote:
>
>> | So how long are you waiting when determining it's hanging? Less than 30
>> seconds?
>>
>> Just to be double sure, I let it sit for an hour yesterday. I would have
>> expected a timeout, but it never came.
>>
>> It's a pretty easy scenario to script and try out. I'd send you my code,
>> but it relies on other internal commands.
>>
>> There may also be a timing issue. If I put a 10 second delay after
>> restarting one broker, and before stopping the next, it seems to help.
>>
>> That is:
>>
>> for x in broker_list:
>>     stop x
>>     start x
>>     sleep(10)
>>
>> Matt
>>
>>
>> On Fri, Feb 24, 2012 at 4:22 AM, Simon MacMullen <simon at rabbitmq.com>wrote:
>>
>>> On 23/02/12 21:00, Matt Pietrek wrote:
>>>
>>>> The nohup.out on the failing node ends with:
>>>>
>>>
>>> <snip>
>>>
>>>  starting database
>>>> ...
>>>>
>>>
>>> So how long are you waiting when determining it's hanging? Less than 30
>>> seconds?
>>>
>>> Because that looks like Rabbit is waiting for another cluster node (if
>>> it was not the last to shut down, but is the first to start up, it will
>>> wait for the one that was the last to shut down. But it will only wait for
>>> 30 seconds before spitting out an error. I'm not sure how else you could
>>> get it to stop there *without* any further output though.
>>>
>>>
>>> Cheers, Simon
>>>
>>> --
>>> Simon MacMullen
>>> RabbitMQ, VMware
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120314/592e18bb/attachment.htm>