Skip to main content

Reliability and Right-Sizing

In the world of failure mitigation, a threat to a single component in isolation is considered a lucky break – many operations teams tasked with protecting uptime are thoroughly familiar with defensive applications of Murphy’s Law. In practice, this usually means teams need to be prepared for situations like a storage problem coinciding with another hardware problem like the failure of a memory module or a local network device. Continuing the previous article’s discussion of High Availability architecture, there are some interesting considerations to make when a positive uptime trend starts to push events from the ‘unlikely’ end to the ‘probable’ end of the scale.

By The Numbers

When considering the impact of concurrent failures on availability, the architecture of a storage cluster can best serve an organization when routines like fault tolerant recovery and workload restarts satisfy the requirements of both business vitality and continuity. That is, being online and being capable are unique concerns with some underlying similarities. An application or service that always needs 16GB of RAM and its 400GB database to perform its operational cycles without customer pain points will be in for a rough ride when an appropriately tuned storage architecture that the VM needs is on the other side of a cluster partition. This is the particular sort of situation that the Anvil! is specifically designed to avoid. To achieve this, its architecture packs fault tolerant block level storage with every node deployed by employing redundancy comprehensively across all technology layers. What follows is an exploration of what that means in more detail.

Scalability & Availability

Popular storage clustering solutions typically consist of software defined, logical storage pools spanning multiple physical hosts within a network. To improve efficiency and provide more ability to scale your runtime, logical devices which are larger than their smallest physical member can be created by aggregating unused portions of space across distributed physical devices. Now, if you need a 50GB logical device you can make one out of two 10GB and two 15GB portions of unused space on the physical devices. Though resource efficient, this approach has some tradeoffs that architecture teams grapple with when choosing an approach.

When complex threats to continuity manifest as multiple concurrent component failures, the choice doesn’t have to be between downtime and reduced operational capacity if the systems design decisions consider what you might call logical storage geometry (or at least, I will in this article, lacking a better term). An approach to redundancy that considers logical storage geometry can provide availability that is minimally impacted by such complex threats. An architecture like that of the Anvil!, that decidedly refrains from using aggregate storage architecture, makes this tradeoff to prioritize capacity in the face of multi-layer failures. This ensures that a workload’s IO is as local, complete, and consolidated as possible, which also has the added benefit of effectively addressing the challenge of post-partition loss of aggregate capacity, because it simply doesn’t rely on such architecture.

Why Not Both?

Fortunately, if operational cycles and pain points within the organization’s technology strategy are appropriately tuned to the architecture the organization chooses, business process friction at the tech level can be managed and meaningfully reduced. A virtual machine, like a ‘pod’ in other clustered server architectures, provides a useful base unit when accounting for solutions to these pain points. A familiar benefit to these useful base units is the Anvil!’s running VMs’ ability to be (manually) or (automatically) migrated from one node to another, a process that will feel familiar to vMotion operators. For high availability and fault tolerance in particular, it becomes a fundamental component of any redundancy strategy. Knowing the bounds of the smallest unit of a viable business technology allows for the preparation of unitary spare capacity, or crumple zones. Crumple zones are areas of acceptable resource loss between atomic units of availability, meaning they ensure a minimum unit of recovery that corresponds as tightly as possible with the standard unit of capacity.

While this approach does leave resources on the table, what it takes in preference is reliability – it prioritizes both logical and physical redundancy, achieving these states via congruent storage geometry within a node. This means threats to availability are always resolved in an atomic way at each layer – If a threat to a VM represents a threat to 100% of capacity, a proper recovery as far as the Anvil! is concerned is recovery of 100% capacity.

A Strategy for Vigilance

The tendency for concurrent failures to occur is a familiar reality for all operations teams, and especially those whose infrastructure is more modest. In the face of the greater challenges to uptime that – somewhat ironically – are awarded to applications for having achieved high uptime in the first place, an approach for what once seemed unlikely becomes essential. The Anvil! takes this preparatory work as a first principle, ensuring that single-component failures create less tenuous circumstances in the interim, and empowering further exploration into the far reaches of business continuity.