[rabbitmq-discuss] Hang on "starting database ...." remains in 2.8.2 cluster

Matt Pietrek mpietrek at skytap.com
Wed May 9 00:38:51 BST 2012


Hi Francesco,

I run rabbitmq on 3 separate Ubuntu 10.04 64 bit VMs. Clustering is
enabled via the rabbitmq config file that lists all three hosts (all
them A, B, and C)

I start up all the VMs concurrently (via Capistrano) and verify that
the cluster is running as expected. I then go through this sequence:

--------
# On host A:
rabbitmqctl -n rabbit at A stop
nohup $RABBITMQ_SCRIPT_DIR/rabbitmq-server &
rabbitmqctl wait $PIDFILE

# On host B:
rabbitmqctl -n rabbit at B stop
nohup $RABBITMQ_SCRIPT_DIR/rabbitmq-server &
rabbitmqctl wait $PIDFILE

# On host C:
rabbitmqctl -n rabbit at C stop
nohup $RABBITMQ_SCRIPT_DIR/rabbitmq-server &
rabbitmqctl wait $PIDFILE
--------

The idea being to bring down one server while still retaining two in
the cluster.

During one of the start operations (it's not consistent from run to
run), rabbitmq-server will not finish starting up. The last line in
that node's nohup.dat file is:

"starting database   ....."

FWIW, it might be helpful to put the shutdown/startup commands in a
script that you can loop over repeatedly so as to try the whole
sequence numerous times. We use Capistrano here to execute actions on
remote machines, but you can probably use SSH to get the same effect
from a script file.

Let me know if you have other questions about our setup,

Matt


On Tue, May 8, 2012 at 3:52 AM, Francesco Mazzoli
<francesco at rabbitmq.com> wrote:
> Hi Matt,
>
> Predictably I can't reproduce this. Since you say that it'll happen
> "inevitably" (while if I understand correctly in your previous messages it
> was tricky to reproduce), can you send us more information about your setup
> and the steps on how to trigger the problem?
>
> Francesco.
>
>
> On 04/05/12 23:55, Matt Pietrek wrote:
>>
>> I've written this alias before about this topic, and the problem
>> remains in 2.8.2. See:
>>
>> http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2012-February/018414.html
>>
>> I have a three node cluster running RabbitMQ 2.82/Erlang R13B03 on Ubuntu
>> 10.04.
>>
>> Once the cluster is up and running properly (as observed by the Web
>> UI), I then start/stop individual nodes in the cluster:
>>     rabbitmqctl stop
>>     rabbitmq-server
>>
>> Inevitably one of the nodes won't come back up, waiting forever on
>> "starting" the database (no 30 second timeout... Forever.)
>>
>> The only way to get all three nodes functioning again together is to
>> forcibly stop the other two nodes, then restart them all again.
>>
>>
>> The first item below is the console output as captured via nohup,
>> showing "starting database" as the last item.
>> The second item below is the last few lines of the rabbit@<node>.log
>> file, showing the node shutting down, then beginning to start up
>> again.
>>
>> Is it likely that a newer Erlang version would help out?
>> What else can I provide to help diagnose this?
>>
>> Thanks,
>>
>> Matt
>>
>> --------
>> node           : rabbit at util
>> app descriptor :
>> /usr/lib/rabbitmq/lib/rabbitmq_server-2.8.2/sbin/../ebin/rabbit.app
>> home dir       : /home/mpietrek
>> config file(s) : /home/mpietrek/work/var/run/rabbitmq.config
>> cookie hash    : pR5H9kY3Wra/XdLELT5hgQ==
>> log            :
>>
>> /home/mpietrek/work/logs/util.mpietrek.internal.illumita.com/rabbit at util.log
>> sasl log       :
>>
>> /home/mpietrek/work/logs/util.mpietrek.internal.illumita.com/rabbit at util-sasl.log
>> database dir   : /home/mpietrek/work/var/lib/rabbit at util
>> erlang version : 5.7.4
>>
>> -- rabbit boot start
>> starting file handle cache server
>> ...done
>> starting worker pool
>>  ...done
>> starting database                                                     ...
>>
>> --------
>>
>> =INFO REPORT==== 4-May-2012::15:02:14 ===
>>     application: rabbitmq_management_agent
>>     exited: stopped
>>     type: permanent
>>
>> =INFO REPORT==== 4-May-2012::15:02:14 ===
>> stopped TCP Listener on 0.0.0.0:5672
>>
>> =INFO REPORT==== 4-May-2012::15:02:14 ===
>>     application: rabbit
>>     exited: stopped
>>     type: permanent
>>
>> =INFO REPORT==== 4-May-2012::15:02:14 ===
>>     application: os_mon
>>     exited: stopped
>>     type: permanent
>>
>> =INFO REPORT==== 4-May-2012::15:02:14 ===
>>     application: mnesia
>>     exited: stopped
>>     type: permanent
>>
>> =INFO REPORT==== 4-May-2012::15:02:14 ===
>> Halting Erlang VM
>>
>> =INFO REPORT==== 4-May-2012::15:02:52 ===
>> Limiting to approx 924 file handles (829 sockets)
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> rabbitmq-discuss at lists.rabbitmq.com
>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>


More information about the rabbitmq-discuss mailing list