[rabbitmq-discuss] cluster node "stuck" during start

Not Drew Stevens not.drew.stevens at gmail.com
Fri Jul 25 20:34:59 BST 2014


When a RabbitMQ cluster node starts back up after a server reboot, we
have experienced (more than a few) cases where the RabbitMQ server on
the node does not completely start.

This condition persisted even if the rabbit processes were killed and
rabbit manually restarted.

The only way to get the server to start required a node reset (or
explicit deletion of the mnesia database)

Are there any suggestions about how to handle this without losing the
state of the node?

The system process list looked like this:

# ps aux | grep rabbit
rabbitmq  1005  0.0  0.0   9888  2788 ?        S    Jun13   1:01
/usr/lib/erlang/erts-5.10.2/bin/epmd -daemon
root     15746  0.0  0.0  11232  1708 pts/3    S+   23:26   0:00 /bin/sh
/etc/init.d/rabbitmq-server start
root     15797  0.0  0.0  11036  1468 pts/3    S+   23:26   0:00 /bin/sh
/usr/sbin/rabbitmqctl wait /var/run/rabbitmq/pid
rabbitmq 15799  0.0  0.0  11036  1424 ?        S    23:26   0:00 /bin/sh
/usr/sbin/rabbitmq-server
rabbitmq 15807  3.1  1.2 599876 47728 ?        Sl   23:26   0:03
/usr/lib/erlang/erts-5.10.2/bin/beam -W w -K true -A30 -P 1048576 --
-root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa
/usr/lib/rabbitmq/lib/rabbitmq_server-3.2.1/sbin/../ebin -noshell
-noinput -s rabbit boot -sname rabbit at my-rmq-server -boot start_sasl
-config /etc/rabbitmq/rabbitmq -kernel inet_default_connect_options
[{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false
-rabbit error_logger {file,"/var/log/rabbitmq/rabbit at my-rmq-server.log"}
-rabbit sasl_error_logger
{file,"/var/log/rabbitmq/rabbit at my-rmq-server-sasl.log"} -rabbit
enabled_plugins_file "/etc/rabbitmq/enabled_plugins" rabbit plugins_dir
"/usr/lib/rabbitmq/lib/rabbitmq_server-3.2.1/sbin/../plugins" -rabbit
plugins_expand_dir
"/var/lib/rabbitmq/mnesia/rabbit at my-rmq-server-plugins-expand" -os_mon
start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup
false -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit at my-rmq-server"
rabbitmq 15814  0.0  0.0  94432  2636 pts/3    S+   23:26   0:00 su
rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmqctl  "wait"
"/var/run/rabbitmq/pid"
rabbitmq 15819  0.2  0.3 106624 14008 pts/3    Sl+  23:26   0:00
/usr/lib/erlang/erts-5.10.2/bin/beam -- -root /usr/lib/erlang -progname
erl -- -home /var/lib/rabbitmq -- -pa
/usr/lib/rabbitmq/lib/rabbitmq_server-3.2.1/sbin/../ebin -noshell
-noinput -hidden -sname rabbitmqctl15819 -boot start_clean -s
rabbit_control_main -nodename rabbit at my-rmq-server -extra wait
/var/run/rabbitmq/pid

This RabbitMQ node showed as an "up" node in the Nodes list in the
management console of another node in the cluster.

Also, rabbitmqctl returned some results:


# rabbitmqctl status
Status of node 'rabbit at my-rmq-server' ...
[{pid,1114},
 {running_applications,
     [{os_mon,"CPO  CXC 138 46","2.2.12"},
      {inets,"INETS  CXC 138 49","5.9.5"},
      {mnesia,"MNESIA  CXC 138 12","4.9"},
      {amqp_client,"RabbitMQ AMQP Client","3.2.1"},
      {rabbitmq_auth_mechanism_ssl,
          "RabbitMQ SSL authentication (SASL EXTERNAL)","3.2.1"},
      {xmerl,"XML parser","1.3.3"},
      {eldap,"Ldap api","1.0.1"},
      {rfc4627_jsonrpc,"JSON RPC Service","3.2.1-git5e67120"},
      {sasl,"SASL  CXC 138 11","2.3.2"},
      {stdlib,"ERTS  CXC 138 10","1.19.2"},
      {kernel,"ERTS  CXC 138 10","2.16.2"}]},
 {os,{unix,linux}},
 {erlang_version,
     "Erlang R16B01 (erts-5.10.2) [source-bdf5300] [64-bit] [smp:2:2]
[async-threads:30] [hipe] [kernel-poll:true]\n"},
 {memory,
     [{total,44596672},
      {connection_procs,2808},
      {queue_procs,0},
      {plugins,8464},
      {other_proc,15751480},
      {mnesia,1191152},
      {mgmt_db,0},
      {msg_index,0},
      {other_ets,1235896},
      {binary,716136},
      {code,20445199},
      {atom,711569},
      {other_system,4533968}]},
 {file_descriptors,
    
[{total_limit,924},{total_used,0},{sockets_limit,829},{sockets_used,0}]},
 {processes,[{limit,1048576},{used,105}]},
 {run_queue,0},
 {uptime,271}]
...done.

The startup log and rabbitmq log indicated that the node had started to
start up

# cat startup_log

              RabbitMQ 3.2.1. Copyright (C) 2007-2013 GoPivotal, Inc.
  ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/
  ##  ##
  ##########  Logs: /var/log/rabbitmq/rabbit at my-rmq-server.log
  ######  ##        /var/log/rabbitmq/rabbit at my-rmq-server-sasl.log
  ##########
              Starting broker...


# cat rabbit at my-rmq-server.log

=INFO REPORT==== 25-Jul-2014::17:18:21 ===
Starting RabbitMQ 3.2.1 on Erlang R16B01
Copyright (C) 2007-2013 GoPivotal, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/

=INFO REPORT==== 25-Jul-2014::17:18:21 ===
node           : rabbit at my-rmq-server
home dir       : /var/lib/rabbitmq
config file(s) : /etc/rabbitmq/rabbitmq.config
cookie hash    : WmWI9mzuXn9u47LQDipY3g==
log            : /var/log/rabbitmq/rabbit at my-rmq-server.log
sasl log       : /var/log/rabbitmq/rabbit at my-rmq-server-sasl.log
database dir   : /var/lib/rabbitmq/mnesia/rabbit at my-rmq-server

=INFO REPORT==== 25-Jul-2014::17:18:23 ===
Limiting to approx 924 file handles (829 sockets)
root at my-rmq-server:/var/log/rabbitmq#


Some time had passed without any activity to either the logs, or files
in the mnesia database

# date
Fri Jul 25 17:23:56 UTC 2014


# ls -lt /var/lib/rabbitmq/mnesia/rabbit at my-rmq-server
total 148
-rw-r--r--   1 rabbitmq rabbitmq   271 Jul 25 17:21 DECISION_TAB.LOG
-rw-r--r--   1 rabbitmq rabbitmq   102 Jul 25 17:21 LATEST.LOG
-rw-r--r--   1 rabbitmq rabbitmq   171 Jul 25 17:18
nodes_running_at_shutdown
-rw-r--r--   1 rabbitmq rabbitmq   317 Jul 25 17:18 cluster_nodes.config
-rw-r--r--   1 rabbitmq rabbitmq   137 Jul 25 17:18 rabbit_vhost.DCD
-rw-r--r--   1 rabbitmq rabbitmq   640 Jul 25 17:18 rabbit_user.DCD
-rw-r--r--   1 rabbitmq rabbitmq 10207 Jul 25 17:18
rabbit_runtime_parameters.DCD
-rw-r--r--   1 rabbitmq rabbitmq 20423 Jul 25 17:18 rabbit_durable_route.DCD
-rw-r--r--   1 rabbitmq rabbitmq 21020 Jul 25 17:18 rabbit_durable_queue.DCD
-rw-r--r--   1 rabbitmq rabbitmq  2724 Jul 25 17:18
rabbit_durable_exchange.DCD
-rw-r--r--   1 rabbitmq rabbitmq   850 Jul 25 17:18
rabbit_user_permission.DCD
drwxr-xr-x   2 rabbitmq rabbitmq  4096 Jul 25 17:16 msg_store_transient
drwxr-xr-x   2 rabbitmq rabbitmq  4096 Jul 25 17:16 msg_store_persistent
drwxr-xr-x 170 rabbitmq rabbitmq 12288 Jul 25 17:16 queues
-rw-r--r--   1 rabbitmq rabbitmq 28983 Jul 24 23:35 schema.DAT
-rw-r--r--   1 rabbitmq rabbitmq     3 Jun 13 09:41 rabbit_serial
-rw-r--r--   1 rabbitmq rabbitmq   238 Jun 13 09:41 schema_version











-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140725/b13b9278/attachment.html>


More information about the rabbitmq-discuss mailing list