[rabbitmq-discuss] rabbitmq 2.6.1 cluster failure recovery

Simon MacMullen simon at rabbitmq.com
Mon Oct 3 12:26:55 BST 2011


Hi Alain.

When you see timeout_waiting_for_tables, that should mean that the node 
you're trying to start:

* Could not find any other cluster nodes running

* Was not the last node to shut down

 From your explanation it sounds like node-1 *is* running while you 
restart node-2 - is that correct? In that case, can node-2 definitely 
see node-1? (i.e. it can ping cumulonimbus)

Cheers, Simon

On 01/10/11 01:25, Alain Dazzi wrote:
> Hi,
>
> I can't get my rabbitmq cluster to recover from a dead node. So
> perhaps someone can help ...
>
> node-1 (cumulonimbus)
> Linux cumulonimbus 2.6.38-11-server #50-Ubuntu SMP Mon Sep 12 21:34:27
> UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
> ii  rabbitmq-server                       2.6.1-1
> root at cumulonimbus:~# ls -1 /usr/lib/rabbitmq/lib/rabbitmq_server-2.6.1/plugins/
> amqp_client-2.6.1.ez
> mochiweb-1.3-rmq2.6.1-git9a53dbd.ez
> rabbitmq_management-2.6.1.ez
> rabbitmq_management_agent-2.6.1.ez
> rabbitmq_management_visualiser-2.6.1.ez
> rabbitmq_mochiweb-2.6.1.ez
> README
> webmachine-1.7.0-rmq2.6.1-hg0c4b60a.ez
>
>
> node-2 (nuage-informatique)
> Linux nuage-informatique 2.6.38-11-generic #50-Ubuntu SMP Mon Sep 12
> 21:17:25 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
> ii  rabbitmq-server                       2.6.1-1
>
> 1/ stop both servers and set-up same .erlang_cookie value; restart nodes
>
> 2/ on node1 I create a cluster
>      rabbitmqctl stop_app
>      rabbitmqctl reset
>      rabbitmqctl cluster rabbit at nuage-informatique rabbit at cumulonimbus
> Clustering node rabbit at cumulonimbus with ['rabbit at nuage-informatique',
>                                            rabbit at cumulonimbus] ...
> ...done.
>
> 3/ This creates 2 disc nodes !!!
>
> 4/ run a test and pass data successfully
>
> 5/ restart node-2 (service rabbitmq-server stop)
> service rabbitmq-server start ... fails with ...
> root at nuage-informatique:~/Desktop# service rabbitmq-server start
> Starting rabbitmq-server: FAILED - check /var/log/rabbitmq/startup_{log, _err}
> rabbitmq-server.
> Erlang has closed
> ^M
> Crash dump was written to: erl_crash.dump^M
> Kernel pid terminated (application_controller)
> ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{rabbit,failure_during_boot}}}}})^M
>
> Activating RabbitMQ plugins ...
> 1 plugins activated:
> * rabbitmq_management_agent-2.6.1
>
>
> +---+   +---+
> |   |   |   |
> |   |   |   |
> |   |   |   |
> |   +---+   +-------+
> |                   |
> | RabbitMQ  +---+   |
> |           |   |   |
> |   v2.6.1  +---+   |
> |                   |
> +-------------------+
> AMQP 0-9-1 / 0-9 / 0-8
> Copyright (C) 2007-2011 VMware, Inc.
> Licensed under the MPL.  See http://www.rabbitmq.com/
>
> node           : rabbit at nuage-informatique
> app descriptor :
> /usr/lib/rabbitmq/lib/rabbitmq_server-2.6.1/sbin/../ebin/rabbit.app
> home dir       : /var/lib/rabbitmq
> config file(s) : (none)
> cookie hash    : qHpvLciGsi5o4f8ScVzyWg==
> log            : /var/log/rabbitmq/rabbit at nuage-informatique.log
> sasl log       : /var/log/rabbitmq/rabbit at nuage-informatique-sasl.log
> database dir   : /var/lib/rabbitmq/mnesia/rabbit at nuage-informatique
> erlang version : 5.7.4
>
> -- rabbit boot start
> starting file handle cache server                                     ...done
> starting worker pool                                                  ...done
> starting database
> ...BOOT ERROR: FAILED
> Reason: {error,
>              {timeout_waiting_for_tables,
>                  [rabbit_user,rabbit_user_permission,rabbit_vhost,
>                   rabbit_durable_route,rabbit_durable_exchange,
>                   rabbit_durable_queue]}}
> Stacktrace: [{rabbit_mnesia,wait_for_tables,1},
>               {rabbit_mnesia,check_schema_integrity,0},
>               {rabbit_mnesia,ensure_schema_integrity,0},
>               {rabbit_mnesia,init,0},
>               {rabbit,'-run_boot_step/1-lc$^1/1-1-',1},
>               {rabbit,run_boot_step,1},
>               {rabbit,'-start/2-lc$^0/1-0-',1},
>               {rabbit,start,2}]
> {"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{rabbit,failure_during_boot}}}}}"}^M
>
> At this point I have to re-install node-2 to recover.
>
> Any idea why?
>
> Thank you,
>
> next I would like to test mirrored q but obviously this has to work first...
>
> -Alain
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss


-- 
Simon MacMullen
RabbitMQ, VMware


More information about the rabbitmq-discuss mailing list