<div dir="ltr">We have a 3 node rabbitmq cluster consisting of 2 disk nodes and one memory node. &nbsp;(disk nodes are rabbitmq-00 and 01, memory node is core-01)&nbsp;<br><br><div>queues are durable and mirrored (show +2 in control panel, etc.) and show syncronized:<br><font color="#ff0000"><br></font></div><div><div><font color="#ff0000"># rabbitmqctl list_queues name slave_pids synchronised_slave_pids</font></div><div><font color="#ff0000">Listing queues ...</font></div></div><div><font color="#ff0000">...<br>SVC_mailbox_lookup<span class="Apple-tab-span" style="white-space:pre">        </span>[&lt;'rabbit@rabbitmq-01'.2.301.0&gt;, &lt;'rabbit@core-01'.1.268.0&gt;]<span class="Apple-tab-span" style="white-space:pre">        </span>[&lt;'rabbit@core-01'.1.268.0&gt;, &lt;'rabbit@rabbitmq-01'.2.301.0&gt;]<br></font></div><div><font color="#ff0000">...</font></div><div><font color="#ff0000"><br></font></div><div><font color="#ff0000"># &nbsp;rabbitmqctl list_policies</font></div><div><font color="#ff0000">Listing policies ...</font></div><div><font color="#ff0000">/<span class="Apple-tab-span" style="white-space:pre">        </span>ha-all<span class="Apple-tab-span" style="white-space:pre">        </span>^SVC_<span class="Apple-tab-span" style="white-space:pre">        </span>{"ha-mode":"all"}<span class="Apple-tab-span" style="white-space:pre">        </span>0</font></div><div><font color="#ff0000">...done.</font></div><div><br></div><div><br></div><div>We put in SSD mounted to '/var/lib/rabbitmq' to host the mnesia database on rabbitmq-00/01. &nbsp;we only did a single drive figuring that if the disk failed the node would crash and the others in the HA cluster would take over - all clients have been coded for failover.</div><div><br></div><div>The SSD on rabbitmq-00 failed. &nbsp; i don't have logs of that event from rabbitmq-00's point of view - for some reason it didn't write out anything.</div><div><br></div><div>I do have it from rabbitmq-01's:<br><font color="#ff0000"><br></font></div><div><font color="#ff0000">=INFO REPORT==== 26-Sep-2013::16:07:20 ===<br>Mirrored-queue (queue 'SVC_mailbox_lookup' in vhost '/'): Slave &lt;'<a href="rabbit@4c-rabbitmq-01'.3.785.0&gt;">rabbit@rabbitmq-01'.3.785.0&gt;</a> saw deaths of mirrors &lt;'<a href="rabbit@4c-rabbitmq-00'.3.1415.0&gt;">rabbit@rabbitmq-00'.3.1415.0&gt;</a> <br><br>=INFO REPORT==== 26-Sep-2013::16:07:20 ===<br>Mirrored-queue (queue 'SVC_mailbox_lookup' in vhost '/'): Promoting slave &lt;'<a href="rabbit@4c-rabbitmq-01'.3.785.0&gt;">rabbit@rabbitmq-01'.3.785.0&gt;</a> to master<br></font></div><div><br></div><div>but then:<br><font color="#ff0000"><br>=ERROR REPORT==== 26-Sep-2013::16:17:17 ===<br>connection &lt;0.487.0&gt;, channel 1 - soft error:<br>{amqp_error,not_found,<br>            "home node 'rabbit@core-01' of durable queue 'SVC_mailbox_lookup' in vhost '/' is down or inaccessible",<br>            'queue.declare'}</font><br><br></div><div>This is repeated for each queue.&nbsp;</div><div><br></div><div>It looks like rabbitmq-01 took over as master, but then the nodes become non-responsive because they can't write to disk on core-01 (the memory node.)<br><br><div>we shutdown whatever was still running on rabbitmq-00. &nbsp; and everything was still unavailable. &nbsp;we then shutdown core-01 and lastly rabbitmq-01, then restarted rabbitmq-01, but it came up with NO queues. &nbsp;&nbsp;</div><div><br></div>is this an error with the way the HA cluster is handling failover or an error with our configurations - should we not mix memory and disk nodes in an HA cluster? &nbsp;&nbsp;<br><br>I'm trying to figure this our because we want to be sure that if any node in the cluster fails, the others take over seamlessly. &nbsp;our code does that... we just need the clusters to soldier on and that no records are lost.</div><div><br></div><div>Thanks. &nbsp;</div><div><br></div></div>