[rabbitmq-discuss] Cluster nodes stop/start order can lead to failures

Thu Sep 20 20:52:50 BST 2012

Wouldn't having no less than 3 nodes help with this? I would imagine that 
the last node, even if it's a RAM node, would have the latest exchanges, 
bindings, etc. and would be able to deliver them authoritatively to the any 
node coming back online.  Obviously changes to exchanges and bindings could 
not occur during the time period during which no durable nodes are 
available, but I can't see why a RAM node couldn't help in this scenario. 
 I haven't yet tested this scenario just yet, but it's simple enough to 
prove with a few AWS cloud instances.

On Thursday, September 13, 2012 2:57:15 AM UTC-6, Jignesh Purohit wrote:
>
> Hi Matt,
>
> I too facing the same problem so kindly let me know if you get any 
> solution for this problem.
>
> Regards
> Jignesh Purohit
>
>
> On Thursday, September 13, 2012 2:13:14 AM UTC+5:30, Matt Long wrote:
>>
>> Say I have node1 and node2 both running as disc nodes in a cluster (there 
>> are no other nodes in the cluster). If I stop rabbitmq-server on node1 and 
>> then stop rabbitmq-server on node2, I'm unable to then start 
>> rabbitmq-server again on node1...in particular, the start command hangs for 
>> ~35 seconds before showing FAILED...
>>
>> Is this the expected behavior? Note that starting node2 after having 
>> stopped node1 and then node2 works fine; I'm assuming because node2 was 
>> aware that node1 had went offline prior to its stopping.
>>
>> The relevant bit from the startup_log on node1 is :
>> BOOT FAILED
>> ===========
>>
>> Timeout contacting cluster nodes: ['rabbit at node2'].
>>
>> Here's all the details:
>>
>> *node1*$ sudo service rabbitmq-server stop
>> Stopping rabbitmq-server: rabbitmq-server.
>>
>> *node2*$ sudo service rabbitmq-server stop
>> Stopping rabbitmq-server: rabbitmq-server.
>>
>> *node1*$ sudo service rabbitmq-server start
>> Starting rabbitmq-server: FAILED - check /var/log/rabbitmq/startup_{log, 
>> _err}
>> rabbitmq-server.
>>
>> Contents of startup_err:
>>
>> Crash dump was written to: erl_crash.dump
>> Kernel pid terminated (application_controller) 
>> ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{rabbit,failure_during_boot}}}}})
>>
>> Tail end of startup_log:
>>
>> -- rabbit boot start
>> starting file handle cache server                                     
>> ...done
>> starting worker pool                                                 
>>  ...done
>> starting database                                                     ...
>>
>> BOOT FAILED
>> ===========
>>
>> Timeout contacting cluster nodes: ['rabbit at node2'].
>>
>> DIAGNOSTICS
>> ===========
>>
>> nodes in question: ['rabbit at node2']
>>
>> hosts, their running nodes and ports:
>> - node2: []
>>
>> current node details:
>> - node name: 'rabbit at node1'
>> - home dir: /var/lib/rabbitmq
>> - cookie hash: xxxredactedxxxxxxxxxxx==
>>
>>
>> {"Kernel pid 
>> terminated",application_controller,"{application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{rabbit,failure_during_boot}}}}}"}
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120920/684f10a6/attachment.htm>