[rabbitmq-discuss] Cluster busting - shut off all nodes at the same time.

Tue Oct 30 18:07:12 GMT 2012

Hi Simon,

I am sorry that my post appears all over the place. I am trying to work out 
a clear understanding of a cluster to advise what not do do to avoid data 
loss. It is becoming clear that server maintenance must account for the 
status of the queues in the cluster and great care must be applied to 
maintain the cluster. I have read the unsynchronised-saves section and 
basically the cluster buster testing is to purposely break the cluster in 
the way that the queue synchronization is documented to not handle. By 
doing this I am learning in more detail about how the cluster actually 
works.

After looking at the section again this statement stands out "... ensure 
that your use of messaging generally results in very short or empty queues 
that rapidly drain."  When dealing with the cluster buster test scenarios I 
have been using a scenario that a queue is not empty and is not rapidly 
drained.  

It has us nervous that if each node in the cluster is restarted one at a 
time and allowed to rejoin the cluster then all data in the idle persistent 
queues would be lost. Performing Windows updates might be fatal to idle 
queued data!  To clarify the messages are published with persistent on.

Back to the cluster test scenario....

I guess the conclusion is the cluster performed as designed and it did 
require all nodes in the cluster to be started under the 30 second timeout. 
 Although, I am not sure if the erl_crash.dump is expected.

When all 3 servers were booted things went as expected but with an 
erl_crash.dump.  Each server performed the 30 second search and then 
shutdown + crash.dump.  After I noticed all 3 services were not running I 
quickly started each service under the 30 seconds and the cluster came up 
plus no queue data was lost.

Each server had its relative log 

=INFO REPORT==== 30-Oct-2012::08:36:36 ===
Limiting to approx 924 file handles (829 sockets)

=ERROR REPORT==== 30-Oct-2012::08:37:13 ===
Timeout contacting cluster nodes. Since RabbitMQ was shut down forcefully
it cannot determine which nodes are timing out. Details on all nodes will
follow.

DIAGNOSTICS
===========

nodes in question: ['rabbit at RIOBARON-1','rabbit at CUST1-MASTER']

hosts, their running nodes and ports:
- CUST1-MASTER: [{rabbit,55021}]
- RIOBARON-1: []

current node details:
- node name: 'rabbit at RIOOVERLORD-1'
- home dir: C:\Windows
- cookie hash: MXZdkdzg76BGNNu+ev94Ow==

=INFO REPORT==== 30-Oct-2012::08:37:14 ===
    application: rabbit
    exited: {bad_return,{{rabbit,start,[normal,[]]},
                         {'EXIT',{rabbit,failure_during_boot}}}}
    type: permanent

Then, in the -sasl.log there is the following entry + the "erl_crash.dump" 
file.

=CRASH REPORT==== 30-Oct-2012::08:35:40 ===
  crasher:
    initial call: application_master:init/4
    pid: <0.133.0>
    registered_name: []
    exception exit: {bad_return,{{rabbit,start,[normal,[]]},
                                 {'EXIT',{rabbit,failure_during_boot}}}}
      in function  application_master:init/4 (application_master.erl, line 
138)
    ancestors: [<0.132.0>]
    messages: [{'EXIT',<0.134.0>,normal}]
    links: [<0.132.0>,<0.7.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 1597
    stack_size: 24
    reductions: 132
  neighbours:

I do not know if the sasl.log entry has anything to do with the crash.dump 
file output but I have one.

=erl_crash_dump:0.1
Tue Oct 30 08:37:17 2012
Slogan: Kernel pid terminated (application_controller) 
({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{rabbit,failure_during_boot}}}}})
System version: Erlang R15B02 (erts-5.9.2) [async-threads:30]
Compiled: Mon Sep  3 11:00:33 2012
Taints: 
Atoms: 21679
=memory
total: 16514240
processes: 4423098
processes_used: 4423098
system: 12091142
atom: 495069
atom_used: 481297
binary: 21864
code: 9403948
ets: 27188
=hash_table:atom_tab
size: 19289
used: 13066
objs: 21679
depth: 8
............................

On Tue, Oct 30, 2012 at 11:49 AM, Simon MacMullen <simon at rabbitmq.com> 
wrote:
I am not sure quite what you are saying. You say that when you started the 
nodes again, none of them successfully started? And there was "an error". 
But then you started them "quickly" and that worked?

When each node is started it decides whether it thinks there are any other 
nodes which were running when it was killed. If so it waits 30 seconds for 
them to become available and if nothing appears gives an error about 
"timeout waiting for tables",

Was this the error you saw?

We might make this 30 seconds configurable in future, but we need to think 
of the other case (where people start one node and not the other, and don't 
realise anything is wrong until the timeout).

You should also read:
http://www.rabbitmq.com/ha.html#unsynchronised-slaves

Cheers, Simon

On Tuesday, October 30, 2012 9:45:21 AM UTC-5, Mark Ward wrote:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20121030/39c8f876/attachment.htm>