<div dir="ltr">Actually we have many clusters running across 3 zones in AWS :)<div><br></div><div>But we are prepared to lose entire regions, wholly or partially.</div><div><br></div><div>And we never persist messages in our rabbits - instead we use a multi-region Cassandra cluster. Oh and S3 for large message bodies.</div>

<div><br></div><div>Plus important messages (anything not individually addressed) are replicated for processing multiple times across multiple regions, racing to resolution.</div><div><br></div><div>It is a 'rabbits everywhere' strategy: a global mesh of redundant cooperating clusters that replicate, route, and resolve messages and use Cassandra and S3 for persistence.</div>

<div><br></div><div>The key to keeping a cluster up across zones in AWS is to never, ever overload it so there is no interruption of inter-cluster communications. The key statistic to monitor is IO wait. </div><div><br></div>

<div>We over-provision our cluster members to be sure they have enough instantaneous resource at all times. And, as I said, we never persist messages on the cluster.</div><div><br></div><div>ml</div></div><div class="gmail_extra">

<br><br><div class="gmail_quote">On Wed, May 14, 2014 at 4:05 AM, Matthias Radestock <span dir="ltr"><<a href="mailto:matthias@rabbitmq.com" target="_blank">matthias@rabbitmq.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="">On 14/05/14 08:58, Simon MacMullen wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

On 13/05/2014 18:04, Leonardo N. S. Pereira wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi Simon, thanks very much for your answer.<br>

What is the recommended set up for HA running in AWS?<br>

Is there a way to workaround the partition problem?<br>

</blockquote>

<br>

Don't cluster across more than two AZs.<br>

<br>

Unless service availability is more important to you than avoiding data<br>

loss, don't cluster across AZs at all.<br>

</blockquote>

<br></div>

Also note that in practice the situation you created in your tests, and which causes the odd behaviour - partial partitions (where communication between nodes is severed in just one direction) - is less likely to occur in practice than full partitions.<span class="HOEnZb"><font color="#888888"><br>


<br>

Matthias.</font></span><div class="HOEnZb"><div class="h5"><br>

______________________________<u></u>_________________<br>

rabbitmq-discuss mailing list<br>

<a href="mailto:rabbitmq-discuss@lists.rabbitmq.com" target="_blank">rabbitmq-discuss@lists.<u></u>rabbitmq.com</a><br>

<a href="https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss" target="_blank">https://lists.rabbitmq.com/<u></u>cgi-bin/mailman/listinfo/<u></u>rabbitmq-discuss</a><br>

</div></div></blockquote></div><br></div>