<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Thu, Feb 20, 2014 at 12:11 PM, Michael Oullion <span dir="ltr"><<a href="mailto:michael.oullion@norbert-dentressangle.com" target="_blank">michael.oullion@norbert-dentressangle.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><p dir="ltr">Thanks Jerry for your quick answer.<br>

What can we do in this situation?<br>

Maybe we can uprise the net tick or use a specific behaviour to manage network split.<br>

Or simply stop take snapshot of the vm because it's not necessary?</p></blockquote><div>You may want to think about why you take the snapshots in your particular workflow.  As long as you have a once in a while snapshot with the configuration of your Rabbit nodes, as you use them in your dev/test/production environment, you should be fine for restoration purposes.  You probably don't need to do that every night unless a lot is changing on them from day to day.</div>

<div><br></div><div>Besides, if that VM were restored from a snapshot, it will wake up into a world where any connected clients and whatnot are likely gone and forgotten and have to slough such things off anyway.  And there may be messages sitting in queues that were long ago delivered to consumers and acted upon, that are now going to come back from the un-snapshotted grave.  If your apps are designed sensibly, favoring idempotency and suitable de-duplication of action at the consumer end, this won't be a big deal, of course.</div>

<div><br></div><div>You may also want to keep an eye on your vSphere monitoring and management stuff to see if anything else is going on around the times these partitions occur.  Partitions are in the eye of each participating beholder, and we detect (really *define*) them via timeout, so anything that renders a node temporarily unable to participate in heart beating will manifest this way.  </div>

<div><br></div><div>Beyond snapshotting, which paralyzes the VM for part of the time the snapshot is being made, I'd also watch out for vMotion, which briefly stuns the VM being motioned into a quiescent state just before vSphere switches over to the migrated VM at its new location, and, possibly the hypervisor paging memory out beneath the guest OS that Rabbit is running on top of, which could make things lag enough that a heartbeat exchange would be missed.  The latter case can be especially sneaky since an ESX host under memory pressure may be paging out guest OSes without them, as far as they know, swapping...</div>

<div><br></div><div>Best regards,</div><div>Jerry</div><div><br></div><div><br></div><div> </div></div></div></div>