[rabbitmq-discuss] RabbitMQ Cluster, split network & VMWare snapshot

Fri Feb 21 01:05:26 GMT 2014

On Thu, Feb 20, 2014 at 12:11 PM, Michael Oullion <
michael.oullion at norbert-dentressangle.com> wrote:

> Thanks Jerry for your quick answer.
> What can we do in this situation?
> Maybe we can uprise the net tick or use a specific behaviour to manage
> network split.
> Or simply stop take snapshot of the vm because it's not necessary?
>
You may want to think about why you take the snapshots in your particular
workflow.  As long as you have a once in a while snapshot with the
configuration of your Rabbit nodes, as you use them in your
dev/test/production environment, you should be fine for restoration
purposes.  You probably don't need to do that every night unless a lot is
changing on them from day to day.

Besides, if that VM were restored from a snapshot, it will wake up into a
world where any connected clients and whatnot are likely gone and forgotten
and have to slough such things off anyway.  And there may be messages
sitting in queues that were long ago delivered to consumers and acted upon,
that are now going to come back from the un-snapshotted grave.  If your
apps are designed sensibly, favoring idempotency and suitable
de-duplication of action at the consumer end, this won't be a big deal, of
course.

You may also want to keep an eye on your vSphere monitoring and management
stuff to see if anything else is going on around the times these partitions
occur.  Partitions are in the eye of each participating beholder, and we
detect (really *define*) them via timeout, so anything that renders a node
temporarily unable to participate in heart beating will manifest this way.

Beyond snapshotting, which paralyzes the VM for part of the time the
snapshot is being made, I'd also watch out for vMotion, which briefly stuns
the VM being motioned into a quiescent state just before vSphere switches
over to the migrated VM at its new location, and, possibly the hypervisor
paging memory out beneath the guest OS that Rabbit is running on top of,
which could make things lag enough that a heartbeat exchange would be
missed.  The latter case can be especially sneaky since an ESX host under
memory pressure may be paging out guest OSes without them, as far as they
know, swapping...

Best regards,
Jerry
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20140220/8f257e08/attachment.html>