[rabbitmq-discuss] Disc node clustering

Fri Jan 27 12:03:43 GMT 2012

Hello.

I have some unpleasant issues while clustering disc nodes.
Two brokers located on two hosts.
Steps to reproduce:

1. Cluster *second* machine *with first* as disc node. 
(RABBITMQ_NODENAME=wosnfs).
[root at epbyminw2482t3 ~]#  rabbitmqctl stop_app && rabbitmqctl reset && 
rabbitmqctl cluster wosnfs@`hostname -s` wosnfs at epbyminw2482t2 && 
rabbitmqctl start_app

2. Remove first node
[root at epbyminw2482t2]# rabbitmqctl stop_app && rabbitmqctl cluster 
wosnfs@`hostname -s` && rabbitmqctl reset && rabbitmqctl start_app

3. Restart rabbitmq-server service on second node.
[root at epbyminw2482t3 ~]# service rabbitmq-server restart
Restarting rabbitmq-server: FAILED - check 
/var/log/rabbitmq/startup_{log, _err}
rabbitmq-server.

[root at epbyminw2482t3 ~]# cat /var/log/rabbitmq/startup_err
Erlang has closed

Crash dump was written to: erl_crash.dump
Kernel pid terminated (application_controller) 
({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{rabbit,failure_during_boot}}}}})
[root at epbyminw2482t3 ~]# cat /var/log/rabbitmq/startup_log
Activating RabbitMQ plugins ...
0 plugins activated:

/<-CUT->/
erlang version : 5.8.4

-- rabbit boot start
starting file handle cache server                                     
...done
starting worker pool                                                  
...done
starting database                                                     
...BOOT ERROR: FAILED
Reason: {error,
             {unable_to_join_cluster,
                 [wosnfs at epbyminw2482t3,wosnfs at epbyminw2482t2],
                 {merge_schema_failed,
                     "Bad cookie in table definition 
mirrored_sup_childspec: wosnfs at epbyminw2482t3 = 
{cstruct,mirrored_sup_childspec,ordered_set,[wosnfs at epbyminw2482t3],[],[],0,read_write,false,[],[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],{{1327,663993,999525},*wosnfs at epbyminw2482t2*},{{3,1},{wosnfs at epbyminw2482t3,{1327,664433,471064}}}}, 
*wosnfs at epbyminw2482t2* = 
{cstruct,mirrored_sup_childspec,ordered_set,[*wosnfs at epbyminw2482t2*],[],[],0,read_write,false,[],[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],{{1327,664434,761441},*wosnfs at epbyminw2482t2*},{{2,0},[]}}\n"}}}
Stacktrace: [{rabbit_mnesia,init_db,3},
              {rabbit_mnesia,init,0},
              {rabbit,'-run_boot_step/1-lc$^1/1-1-',1},
              {rabbit,run_boot_step,1},
              {rabbit,'-start/2-lc$^0/1-0-',1},
              {rabbit,start,2},
              {application_master,start_it_old,4}]
{"Kernel pid 
terminated",application_controller,"{application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{rabbit,failure_during_boot}}}}}"}

All commands finished successfully except the last.
Could you help me to find out what kind of error appeared and why?
It seems that first node (epbyminw2482t2) has already been removed from 
the cluster, why some information about the one left in mnesia on 
another node and appears in error log? I suppose that that correct 
removing of any node from cluster should not influence on others.
Problem is reproducible with arbitrary number of disc nodes.

It is interesting, that if we change an order joining to cluster - join 
first node to second,  then no error will appear.

Environment:
CentOS 6.0
Erlang R1403
RabbitMQ 2.7.1

--
Best regards,
Artsiom

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120127/08026af7/attachment.htm>