Man, I feel silly.. I don't know how I haven't caught this before, but the box I'm running the tests on has no swap partition - completely disabled, I must have booted the wrong image! <br><br>Long story short: <br>
<br>/usr/sbin/rabbitmq-server: line 76: 28477 Killed erl -pa "`dirname $0`/../ebin" ${START_RABBIT} -sname ${NODENAME} -boot start_sasl +W w ${ERL_ARGS} -rabbit tcp_listeners '[{"'${NODE_IP_ADDRESS}'", '${NODE_PORT}'}]' -sasl errlog_type error -kernel error_logger '{file,"'${LOGS}'"}' -sasl sasl_error_logger '{file,"'${SASL_LOGS}'"}' -os_mon start_cpu_sup true -os_mon start_disksup false -os_mon start_memsup true -os_mon start_os_sup false -os_mon memsup_system_only true -os_mon system_memory_high_watermark 0.90 -mnesia dir "\"${MNESIA_DIR}\"" ${CLUSTER_CONFIG} ${RABBIT_ARGS} "$@"<br>
<br>---<br>Nov 16 04:11:45 ip-10-251-102-223 kernel: Out of Memory: Kill process 28470 (rabbitmq-server) score 1084241 and children.<br>Nov 16 04:11:45 ip-10-251-102-223 kernel: Out of memory: Killed process 28477 (beam.smp).<br>
Nov 16 04:11:45 ip-10-251-102-223 kernel: oom-killer: gfp_mask=0x201d2, order=0<br>---<br><br>-rw-r--r-- 1 root root 8 Nov 16 04:11 rabbit_persister.LOG<br>-rw-r--r-- 1 root root 368M Nov 16 04:11 rabbit_persister.LOG.previous<br>
<br>---<br><br>Needless to say, nothing to recover in the first log file (the process must have been killed while dumping from memory). Renaming the .previous file brought rabbit back online in ~30 seconds. What's interesting is that the persister size is really small, not sure how the process could have run out of memory? All messages sent to it very marked as persistent. (it was running overnight)<br>
<br>I'll do some testing with SWAP tomorrow.<br><br>ig<br>