[rabbitmq-discuss] Mnesia crash after RabbitMQ node machine restart (clustered)

Tue Jan 24 11:10:47 GMT 2012

We could probably do with better error reporting / handling in this 
case, but I think this is what happened (certainly I did this and saw 
the same errors)...

* Build a cluster with node1 = disc and node2 = ram
* Stop both nodes
* Start node2 *first*

At this point node2 does not have any disc nodes to catch up from... so 
it initialises a fresh copy of the mnesia database in RAM and starts up 
by itself.

* Start node1

At this point node1 knows that it needs to cluster with node2, but node2 
doesn't agree, and has a different version of the database anyway. Hence 
the error message.

At this point I was able to recover by stopping node2 again, and 
starting node1 first, then node2.

I'll file a bug to make the errors clearer in this case, but for the 
time being you should make sure to always bring at least one disc node 
up first in any cluster.

In your case you should consider switching to a cluster with two disc 
nodes anyway - only having one disc node is a SPOF.

Cheers, Simon

On 24/01/12 09:32, LuCo wrote:
> Hello.
>
> We have a RabbitMQ cluster across 2 machines. Machine 1 is created as
> a disc node, and Machine 2 as a memory node.
>
> These machines are restarted on the weekend every week. This has not
> been a problem over the last 5 weeks (since trialling RabbitMQ) and
> during this time the nodes have been little used, however the weekend
> just passed resulted in the error below on the disc node just after
> the machine on which the node runs was restarted:
>
> "
> =ERROR REPORT==== 22-Jan-2012::01:17:37 ===
> Mnesia('rabbit at MACHINE1'): ** ERROR ** (core dumped to file: "c:/
> Documents and Settings/user/Application Data/RabbitMQ/
> MnesiaCore.rabbit at MACHINE1_1327_195058_860060")
>   ** FATAL ** Failed to merge schema: Bad cookie in table definition
> mirrored_sup_childspec: 'rabbit at MACHINE1' =
> {cstruct,mirrored_sup_childspec,ordered_set,
> ['rabbit at MACHINE2,'rabbit at MACHINE1'],[],[],0,read_write,false,[],
> [],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],
> {{1324,33984,878002},'rabbit at MACHINE1'},{{5,0},{'rabbit at MACHINE2',
> {1324,34466,491790}}}}, 'rabbit at MACHINE2' =
> {cstruct,mirrored_sup_childspec,ordered_set,['rabbit at MACHINE2'],[],[],
> 0,read_write,false,[],[],false,mirrored_sup_childspec,
> [key,mirroring_pid,childspec],[],[],
> {{1327,194914,615072},'rabbit at MACHINE2'},{{2,0},[]}}
> =ERROR REPORT==== 22-Jan-2012::01:17:44 ===
> ** Generic server mnesia_subscr terminating
> ** Last message in was {'EXIT',<0.51.0>,killed}
> ** When Server state == {state,<0.51.0>,57361}
> ** Reason for termination ==
> ** killed
> =ERROR REPORT==== 22-Jan-2012::01:17:44 ===
> ** Generic server mnesia_monitor terminating
> ** Last message in was {'EXIT',<0.51.0>,killed}
> ** When Server state == {state,<0.51.0>,[],[],true,[],undefined,[]}
> ** Reason for termination ==
> ** killed
> =ERROR REPORT==== 22-Jan-2012::01:17:44 ===
> ** Generic server mnesia_recover terminating
> ** Last message in was {'EXIT',<0.51.0>,killed}
> ** When Server state == {state,<0.51.0>,undefined,undefined,undefined,
> 0,false,
>                                 true,[]}
> ** Reason for termination ==
> ** killed
> =ERROR REPORT==== 22-Jan-2012::01:17:44 ===
> ** Generic server mnesia_snmp_sup terminating
> ** Last message in was {'EXIT',<0.51.0>,killed}
> ** When Server state == {state,
>                              {local,mnesia_snmp_sup},
>                              simple_one_for_one,
>                              [{child,undefined,mnesia_snmp_sup,
>                                   {mnesia_snmp_hook,start,[]},
>                                   transient,3000,worker,
>                                   [mnesia_snmp_sup,mnesia_snmp_hook,
>                                    supervisor]}],
>                              undefined,0,86400000,[],mnesia_snmp_sup,
> []}
> ** Reason for termination ==
> ** killed
> =INFO REPORT==== 22-Jan-2012::01:17:44 ===
>      application: mnesia
>      exited: {shutdown,{mnesia_sup,start,[normal,[]]}}
>      type: permanent
> "
>
> The memory node then had this error in the log just after the machine
> on which the node runs was restarted:
> "
> =INFO REPORT==== 22-Jan-2012::01:12:08 ===
> node 'rabbit at G1SVR2-IIS' lost 'rabbit'
> =INFO REPORT==== 22-Jan-2012::01:12:08 ===
> Statistics database started.
> =INFO REPORT==== 22-Jan-2012::01:15:13 ===
> Limiting to approx 924 file handles (829 sockets)
> =INFO REPORT==== 22-Jan-2012::01:15:14 ===
>      application: mnesia
>      exited: stopped
>      type: permanent
> =INFO REPORT==== 22-Jan-2012::01:15:14 ===
> Memory limit set to 818MB of 2047MB total.
> ...<log continues with default initialisation of the node>...
> "
>
> As mentioned, the nodes have been seldom used and contained only 2
> durable queues. This crash resulted in the nodes resuming their
> default configuration (lost previously configured users).
>
> The times on the machines above are the same so I am a little confused
> at the messages on the MACHINE 2 (memory) which seems to have crashed
> before MACHINE 1?
>
> I can understand why this has possibly happened (one node up/one node
> down when attempting to cluster on restart) but why has it not
> happened the previous 5 restarts? What actually happens on a Rabbit
> restart (following server restart) in a cluster scenario? Do I need a
> custom start up script to cover all bases?
>
> Any thoughts?
>
> Daniel
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss

-- 
Simon MacMullen
RabbitMQ, VMware