[rabbitmq-discuss] Mnesia crash after RabbitMQ node machine restart (clustered)

Tue Jan 24 09:32:54 GMT 2012

Hello.

We have a RabbitMQ cluster across 2 machines. Machine 1 is created as
a disc node, and Machine 2 as a memory node.

These machines are restarted on the weekend every week. This has not
been a problem over the last 5 weeks (since trialling RabbitMQ) and
during this time the nodes have been little used, however the weekend
just passed resulted in the error below on the disc node just after
the machine on which the node runs was restarted:

"
=ERROR REPORT==== 22-Jan-2012::01:17:37 ===
Mnesia('rabbit at MACHINE1'): ** ERROR ** (core dumped to file: "c:/
Documents and Settings/user/Application Data/RabbitMQ/
MnesiaCore.rabbit at MACHINE1_1327_195058_860060")
 ** FATAL ** Failed to merge schema: Bad cookie in table definition
mirrored_sup_childspec: 'rabbit at MACHINE1' =
{cstruct,mirrored_sup_childspec,ordered_set,
['rabbit at MACHINE2,'rabbit at MACHINE1'],[],[],0,read_write,false,[],
[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],
{{1324,33984,878002},'rabbit at MACHINE1'},{{5,0},{'rabbit at MACHINE2',
{1324,34466,491790}}}}, 'rabbit at MACHINE2' =
{cstruct,mirrored_sup_childspec,ordered_set,['rabbit at MACHINE2'],[],[],
0,read_write,false,[],[],false,mirrored_sup_childspec,
[key,mirroring_pid,childspec],[],[],
{{1327,194914,615072},'rabbit at MACHINE2'},{{2,0},[]}}
=ERROR REPORT==== 22-Jan-2012::01:17:44 ===
** Generic server mnesia_subscr terminating
** Last message in was {'EXIT',<0.51.0>,killed}
** When Server state == {state,<0.51.0>,57361}
** Reason for termination ==
** killed
=ERROR REPORT==== 22-Jan-2012::01:17:44 ===
** Generic server mnesia_monitor terminating
** Last message in was {'EXIT',<0.51.0>,killed}
** When Server state == {state,<0.51.0>,[],[],true,[],undefined,[]}
** Reason for termination ==
** killed
=ERROR REPORT==== 22-Jan-2012::01:17:44 ===
** Generic server mnesia_recover terminating
** Last message in was {'EXIT',<0.51.0>,killed}
** When Server state == {state,<0.51.0>,undefined,undefined,undefined,
0,false,
                               true,[]}
** Reason for termination ==
** killed
=ERROR REPORT==== 22-Jan-2012::01:17:44 ===
** Generic server mnesia_snmp_sup terminating
** Last message in was {'EXIT',<0.51.0>,killed}
** When Server state == {state,
                            {local,mnesia_snmp_sup},
                            simple_one_for_one,
                            [{child,undefined,mnesia_snmp_sup,
                                 {mnesia_snmp_hook,start,[]},
                                 transient,3000,worker,
                                 [mnesia_snmp_sup,mnesia_snmp_hook,
                                  supervisor]}],
                            undefined,0,86400000,[],mnesia_snmp_sup,
[]}
** Reason for termination ==
** killed
=INFO REPORT==== 22-Jan-2012::01:17:44 ===
    application: mnesia
    exited: {shutdown,{mnesia_sup,start,[normal,[]]}}
    type: permanent
"

The memory node then had this error in the log just after the machine
on which the node runs was restarted:
"
=INFO REPORT==== 22-Jan-2012::01:12:08 ===
node 'rabbit at G1SVR2-IIS' lost 'rabbit'
=INFO REPORT==== 22-Jan-2012::01:12:08 ===
Statistics database started.
=INFO REPORT==== 22-Jan-2012::01:15:13 ===
Limiting to approx 924 file handles (829 sockets)
=INFO REPORT==== 22-Jan-2012::01:15:14 ===
    application: mnesia
    exited: stopped
    type: permanent
=INFO REPORT==== 22-Jan-2012::01:15:14 ===
Memory limit set to 818MB of 2047MB total.
...<log continues with default initialisation of the node>...
"

As mentioned, the nodes have been seldom used and contained only 2
durable queues. This crash resulted in the nodes resuming their
default configuration (lost previously configured users).

The times on the machines above are the same so I am a little confused
at the messages on the MACHINE 2 (memory) which seems to have crashed
before MACHINE 1?

I can understand why this has possibly happened (one node up/one node
down when attempting to cluster on restart) but why has it not
happened the previous 5 restarts? What actually happens on a Rabbit
restart (following server restart) in a cluster scenario? Do I need a
custom start up script to cover all bases?

Any thoughts?

Daniel