[rabbitmq-discuss] Cluster busting - shut off all nodes at the same time.
Mark Ward
ward.mark at gmail.com
Tue Oct 30 18:07:12 GMT 2012
Hi Simon,
I am sorry that my post appears all over the place. I am trying to work out
a clear understanding of a cluster to advise what not do do to avoid data
loss. It is becoming clear that server maintenance must account for the
status of the queues in the cluster and great care must be applied to
maintain the cluster. I have read the unsynchronised-saves section and
basically the cluster buster testing is to purposely break the cluster in
the way that the queue synchronization is documented to not handle. By
doing this I am learning in more detail about how the cluster actually
works.
After looking at the section again this statement stands out "... ensure
that your use of messaging generally results in very short or empty queues
that rapidly drain." When dealing with the cluster buster test scenarios I
have been using a scenario that a queue is not empty and is not rapidly
drained.
It has us nervous that if each node in the cluster is restarted one at a
time and allowed to rejoin the cluster then all data in the idle persistent
queues would be lost. Performing Windows updates might be fatal to idle
queued data! To clarify the messages are published with persistent on.
Back to the cluster test scenario....
I guess the conclusion is the cluster performed as designed and it did
require all nodes in the cluster to be started under the 30 second timeout.
Although, I am not sure if the erl_crash.dump is expected.
When all 3 servers were booted things went as expected but with an
erl_crash.dump. Each server performed the 30 second search and then
shutdown + crash.dump. After I noticed all 3 services were not running I
quickly started each service under the 30 seconds and the cluster came up
plus no queue data was lost.
Each server had its relative log
=INFO REPORT==== 30-Oct-2012::08:36:36 ===
Limiting to approx 924 file handles (829 sockets)
=ERROR REPORT==== 30-Oct-2012::08:37:13 ===
Timeout contacting cluster nodes. Since RabbitMQ was shut down forcefully
it cannot determine which nodes are timing out. Details on all nodes will
follow.
DIAGNOSTICS
===========
nodes in question: ['rabbit at RIOBARON-1','rabbit at CUST1-MASTER']
hosts, their running nodes and ports:
- CUST1-MASTER: [{rabbit,55021}]
- RIOBARON-1: []
current node details:
- node name: 'rabbit at RIOOVERLORD-1'
- home dir: C:\Windows
- cookie hash: MXZdkdzg76BGNNu+ev94Ow==
=INFO REPORT==== 30-Oct-2012::08:37:14 ===
application: rabbit
exited: {bad_return,{{rabbit,start,[normal,[]]},
{'EXIT',{rabbit,failure_during_boot}}}}
type: permanent
Then, in the -sasl.log there is the following entry + the "erl_crash.dump"
file.
=CRASH REPORT==== 30-Oct-2012::08:35:40 ===
crasher:
initial call: application_master:init/4
pid: <0.133.0>
registered_name: []
exception exit: {bad_return,{{rabbit,start,[normal,[]]},
{'EXIT',{rabbit,failure_during_boot}}}}
in function application_master:init/4 (application_master.erl, line
138)
ancestors: [<0.132.0>]
messages: [{'EXIT',<0.134.0>,normal}]
links: [<0.132.0>,<0.7.0>]
dictionary: []
trap_exit: true
status: running
heap_size: 1597
stack_size: 24
reductions: 132
neighbours:
I do not know if the sasl.log entry has anything to do with the crash.dump
file output but I have one.
=erl_crash_dump:0.1
Tue Oct 30 08:37:17 2012
Slogan: Kernel pid terminated (application_controller)
({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{rabbit,failure_during_boot}}}}})
System version: Erlang R15B02 (erts-5.9.2) [async-threads:30]
Compiled: Mon Sep 3 11:00:33 2012
Taints:
Atoms: 21679
=memory
total: 16514240
processes: 4423098
processes_used: 4423098
system: 12091142
atom: 495069
atom_used: 481297
binary: 21864
code: 9403948
ets: 27188
=hash_table:atom_tab
size: 19289
used: 13066
objs: 21679
depth: 8
............................
On Tue, Oct 30, 2012 at 11:49 AM, Simon MacMullen <simon at rabbitmq.com>
wrote:
I am not sure quite what you are saying. You say that when you started the
nodes again, none of them successfully started? And there was "an error".
But then you started them "quickly" and that worked?
When each node is started it decides whether it thinks there are any other
nodes which were running when it was killed. If so it waits 30 seconds for
them to become available and if nothing appears gives an error about
"timeout waiting for tables",
Was this the error you saw?
We might make this 30 seconds configurable in future, but we need to think
of the other case (where people start one node and not the other, and don't
realise anything is wrong until the timeout).
You should also read:
http://www.rabbitmq.com/ha.html#unsynchronised-slaves
Cheers, Simon
On Tuesday, October 30, 2012 9:45:21 AM UTC-5, Mark Ward wrote:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20121030/39c8f876/attachment.htm>
More information about the rabbitmq-discuss
mailing list