At The Far Reaches of Business Continuity
When halted software causes hours of lost productivity, a robust and automated fault tolerant response can mean the difference between a good day and a bad one. With this in mind, organizations that rely on the steady throughput of their software-based production processes have to make their choice of a particular Fault Tolerant high availability (HA) solution a careful one.
Come along as we explore Pacemaker and Corosync – components of the HA solution at the heart of the Anvil!, using one particularly relevant HA solution as a familiar analogue: VMware vSphere Fault Tolerance. We’ll make a few stops along the way to talk about some unique aspects of the Anvil! platform as well.
Some Similarities, Important Differences
In vSphere HA, ESXi hosts (blade or otherwise) are assembled into a compute cluster that provides high availability (VM Restart) and Fault Tolerance (VM Recovery). Fault Tolerance is achieved using what’s called a “Live Shadow Instance.” This instance runs on another VMware FT (Fault Tolerance) ESXi host.
On the Anvil!, execution hosts called “subnodes” are assembled into a compute cluster node, by which it tackles the providing of Fault Tolerant and highly available VMs. These nodes include Red Hat high availability components in their stack. Such components are where the discussion of Pacemaker and Corosync are considered.
Pacemaker
Pacemaker provides automatic fault tolerance at both the compute and storage levels like VMware FT, however instead of a Live Shadow instance, the entire VM’s execution state is replicated at the time of failover using KVM technology. This eliminates the need to keep another VM instance running at all times. Because the KVM interface is from Libvirt, other Libvirt goodies like QEMU emulation capabilities are available if one is comfortable with some drawbacks like additional resource overhead for emulation processes.
Corosync
Corosync provides the consensus protocol interface to keep clustered services such as VMs in the same state across hosts in a node. The particular kind of consensus protocols commonly used involve the circulation of a token through a group of hosts connected by a network in a ring formation. The token is passed around and provides only the host holding the token the ability to broadcast information required for consensus before the token is sent (not broadcast) to the following host in the ring. This process ensures that when a host fails to return the token, automated mitigation can take place and a VM’s state can be recovered, without the need for parallel instruction execution or active-active configurations for clustered workloads.
Along with Libvirtd + KVM Virtualization Components and DRBD’s replicated storage capabilities as clustered services, the Anvil! provides an industry proven solution to automate operational recovery in real time. However, its capabilities have been robustly extended to address a unique set of problems worth consideration.
The Scancore Advantage
When spending a lot of time on dedicated edge infrastructure – where minimal latency is desired – you might have some encounters with the long-tail end of the distribution of probable events; You begin to wonder, “What happens if my uptime threat comes from beyond the hypervisor?”. Expectantly, Alteeve has firm footing here: their very own Scancore decision engine used by the Anvil! provides unique ability to mitigate environmental threats to uptime that are common in edge computing; Outside of highly concentrated computing infrastructure networks such as data centers, redundant services like power and cooling are not necessarily as robust or available. From rising ambient temperatures, to mains power failures, Scancore runs perpetual checks of such vital signs and can be configured to take proactive measures to maintain operations while conditions permit, like proactive migration is enabled if a power outlet fails, as well as to protect continuity when conditions imminently threaten it like a graceful VM shutdown before the depletion of battery backup (and a bumpy cold shutdown of the VM instead) during a long sitewide power outage. When power returns, Scancore restarts automatically and coordinates the careful restart of production workloads in kind.
The Recovery Significance
From spare company equipment, to alternate operational sites, organizations plan for all kinds of interruptions. When a failed component or a hardware error can threaten your operating margin, the right Fault Tolerant high availability product represents an investment. If you’re wondering what options there are beyond VMware, Alteeve’s Anvil! is a worthy candidate.