[rabbitmq-discuss] Cluster busting - shut off all nodes at the same time.

Simon MacMullen simon at rabbitmq.com
Wed Oct 31 10:51:58 GMT 2012


On 30/10/12 18:07, Mark Ward wrote:
> After looking at the section again this statement stands out "... ensure
> that your use of messaging generally results in very short or empty
> queues that rapidly drain."  When dealing with the cluster buster test
> scenarios I have been using a scenario that a queue is not empty and is
> not rapidly drained.
>
> It has us nervous that if each node in the cluster is restarted one at a
> time and allowed to rejoin the cluster then all data in the idle
> persistent queues would be lost. Performing Windows updates might be
> fatal to idle queued data!  To clarify the messages are published
> with persistent on.

Yes.

We have various enhancements planned to queue synchronisation; the 
current situation is clearly not ideal when queues drain slowly.

One thing you'll be able to do in 3.0 (not that this is at all a perfect 
solution) is switch off mirroring for queues, do any cluster 
maintenance, and then switch mirroring back on. Another non-ideal 
alternative is to take the whole cluster down, do your Windows updates, 
and then start it up again.

> Back to the cluster test scenario....
>
> I guess the conclusion is the cluster performed as designed and it did
> require all nodes in the cluster to be started under the 30 second
> timeout.  Although, I am not sure if the erl_crash.dump is expected.

Yeah, erl_crash.dump is emitted whenever RabbitMQ fails to start. The 
log file entries below are also consistent with this happening.

Cheers, Simon

> When all 3 servers were booted things went as expected but with an
> erl_crash.dump.  Each server performed the 30 second search and then
> shutdown + crash.dump.  After I noticed all 3 services were not running
> I quickly started each service under the 30 seconds and the cluster came
> up plus no queue data was lost.
>
> Each server had its relative log
>
> =INFO REPORT==== 30-Oct-2012::08:36:36 ===
> Limiting to approx 924 file handles (829 sockets)
>
> =ERROR REPORT==== 30-Oct-2012::08:37:13 ===
> Timeout contacting cluster nodes. Since RabbitMQ was shut down forcefully
> it cannot determine which nodes are timing out. Details on all nodes will
> follow.
>
> DIAGNOSTICS
> ===========
>
> nodes in question: ['rabbit at RIOBARON-1','rabbit at CUST1-MASTER']
>
> hosts, their running nodes and ports:
> - CUST1-MASTER: [{rabbit,55021}]
> - RIOBARON-1: []
>
> current node details:
> - node name: 'rabbit at RIOOVERLORD-1'
> - home dir: C:\Windows
> - cookie hash: MXZdkdzg76BGNNu+ev94Ow==
>
>
>
> =INFO REPORT==== 30-Oct-2012::08:37:14 ===
>      application: rabbit
>      exited: {bad_return,{{rabbit,start,[normal,[]]},
>                           {'EXIT',{rabbit,failure_during_boot}}}}
>      type: permanent
>
> Then, in the -sasl.log there is the following entry + the
> "erl_crash.dump" file.
>
> =CRASH REPORT==== 30-Oct-2012::08:35:40 ===
>    crasher:
>      initial call: application_master:init/4
>      pid: <0.133.0>
>      registered_name: []
>      exception exit: {bad_return,{{rabbit,start,[normal,[]]},
>                                   {'EXIT',{rabbit,failure_during_boot}}}}
>        in function  application_master:init/4 (application_master.erl,
> line 138)
>      ancestors: [<0.132.0>]
>      messages: [{'EXIT',<0.134.0>,normal}]
>      links: [<0.132.0>,<0.7.0>]
>      dictionary: []
>      trap_exit: true
>      status: running
>      heap_size: 1597
>      stack_size: 24
>      reductions: 132
>    neighbours:
>
>
> I do not know if the sasl.log entry has anything to do with the
> crash.dump file output but I have one.
>
> =erl_crash_dump:0.1
> Tue Oct 30 08:37:17 2012
> Slogan: Kernel pid terminated (application_controller)
> ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{rabbit,failure_during_boot}}}}})
> System version: Erlang R15B02 (erts-5.9.2) [async-threads:30]
> Compiled: Mon Sep  3 11:00:33 2012
> Taints:
> Atoms: 21679
> =memory
> total: 16514240
> processes: 4423098
> processes_used: 4423098
> system: 12091142
> atom: 495069
> atom_used: 481297
> binary: 21864
> code: 9403948
> ets: 27188
> =hash_table:atom_tab
> size: 19289
> used: 13066
> objs: 21679
> depth: 8
> ............................
>
>
> On Tue, Oct 30, 2012 at 11:49 AM, Simon MacMullen <simon at rabbitmq.com>
> wrote:
> I am not sure quite what you are saying. You say that when you started
> the nodes again, none of them successfully started? And there was "an
> error". But then you started them "quickly" and that worked?
>
> When each node is started it decides whether it thinks there are any
> other nodes which were running when it was killed. If so it waits 30
> seconds for them to become available and if nothing appears gives an
> error about "timeout waiting for tables",
>
> Was this the error you saw?
>
> We might make this 30 seconds configurable in future, but we need to
> think of the other case (where people start one node and not the other,
> and don't realise anything is wrong until the timeout).
>
> You should also read:
> http://www.rabbitmq.com/ha.html#unsynchronised-slaves
>
> Cheers, Simon
>
> On Tuesday, October 30, 2012 9:45:21 AM UTC-5, Mark Ward wrote:
>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> rabbitmq-discuss at lists.rabbitmq.com
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>


-- 
Simon MacMullen
RabbitMQ, VMware


More information about the rabbitmq-discuss mailing list