[rabbitmq-discuss] unable to join the cluster node ..

Tim Watson watson.timothy at gmail.com
Wed Oct 23 10:50:39 BST 2013


Hi,

First off, sorry it took so long to get to this - your question managed o slip through he cracks somehow.

On 15 Oct 2013, at 16:58, sagu prf <sagu.prf1 at gmail.com> wrote:

> while stoping the Ram node as master unable to fail over to
> disk which alive more day than ram node and surprisingly it's why this
> disc node not take over as master ( which uptime was higher than ram
> node)

It's not abundantly clear to me why has actually gone wrong here. What exactly does your cluster look like and which commands were applied to which nodes in order to produce the errors below. Did these error logs appear on the node that was stopped (or removed from the cluster), at which point did they appear and how did you detect that the operation failed? Did the command return any additional error information or messages?

> 
> =ERROR REPORT==== 14-Oct-2013::10:57:01 ===
> Mnesia('rabbit at linuxserv05'): ** ERROR ** (core dumped to file:
> "/var/lib/rabbitmq/MnesiaCore.rabbit at linuxserv05_1381_809421_158007")
> ** FATAL ** Failed to merge schema: Bad cookie in table definition
> rabbit_user_permission: 'rabbit at linuxserv05' =
> {cstruct,rabbit_user_permission,set,['rabbit at linuxserv01'],['rabbit at linuxserv06','rabbit at linuxserv05','rabbit at linuxserv04']
> ,[],0,read_write,false,[],[],false,user_permission,[user_vhost,permission],[],[],[],{{1376,164084,168369},'rabbit at linuxserv04'},
> {{4,2},{'rabbit at linuxserv05',{1381,784616,463441}}}},
> 'rabbit at linuxserv01' =
> {cstruct,rabbit_user_permission,set,[],['rabbit at linuxserv01'],
> [],0,read_write,false,[],[],false,user_permission,[user_vhost,permission],[],[],[],{{1381,794520,133541},'rabbit at linuxserv01'},{{2,0},[]}}
> 

What version of rabbit are you running? And what version of erlang? Are all nodes running the same versions of both?

> 
> =ERROR REPORT==== 14-Oct-2013::10:57:11 ===
> ** Generic server mnesia_monitor terminating
> ** Last message in was {'EXIT',<0.46.0>,killed}
> ** When Server state == {state,<0.46.0>,[],[],true,[],undefined,[]}
> ** Reason for termination ==
> ** killed
> 
> =ERROR REPORT==== 14-Oct-2013::10:57:11 ===
> ** Generic server mnesia_recover terminating
> ** Last message in was {'EXIT',<0.46.0>,killed}
> ** When Server state == {state,<0.46.0>,undefined,undefined,undefined,0,false,
>                               true,[]}
> ** Reason for termination ==
> ** killed
> 
> =ERROR REPORT==== 14-Oct-2013::10:57:11 ===
> ** Generic server mnesia_subscr terminating
> ** Last message in was {'EXIT',<0.46.0>,killed}
> ** When Server state == {state,<0.46.0>,20502}
> ** Reason for termination ==
> ** killed
> 
> =INFO REPORT==== 14-Oct-2013::10:57:11 ===
>    application: mnesia
>    exited: {shutdown,{mnesia_sup,start,[normal,[]]}}
>    type: permanent
> 
> =ERROR REPORT==== 14-Oct-2013::10:57:11 ===
> ** Generic server mnesia_snmp_sup terminating
> ** Last message in was {'EXIT',<0.46.0>,killed}
> ** When Server state == {state,
>                            {local,mnesia_snmp_sup},
>                            simple_one_for_one,
>                            [{child,undefined,mnesia_snmp_sup,
>                                 {mnesia_snmp_hook,start,[]},
>                                 transient,3000,worker,
>                                 [mnesia_snmp_sup,mnesia_snmp_hook,
>                                  supervisor]}],
>                            undefined,0,86400000,[],mnesia_snmp_sup,[]}
> ** Reason for termination ==
> ** killed
> 
> i have tried the
> rabbit at linuxserv04% rabbitmqctl -n rabbit at linuxserv01 stop_app

At which point did you do that?

> 
> rabbit at linuxserv04% rabbitmqctl -n rabbit at linuxserv01 reset
> it lost all Queue informations ..

That is what reset does...

> rabbit at linuxserv04%
> failing to join the cluster ..
> 
> linuxserv05%rabbitmq-server -detached failed to  join .
> 
> even i tired with manually but it was failed ..
> 

Seems like the cluster nodes have a different view of the world. Did you experience any network partitions during this time?

> is there any other method to remove this linuxserv01 from cluster and
> bring back another node into cluster (linuxserv05) ? without shutdown
> entire
> cluster .

I don't think so. It looks to me (and perhaps one of the other developers is better versed in this and will correct me) as though your nodes are very confused about cluster membership.

You could've tried starting the failing node with RABBITMQ_NODE_ONLY=1 and then running leave_cluster or some such, but now that you've reset it I don't think that will work.

Tim


More information about the rabbitmq-discuss mailing list