The Complex Nature of Downtime
Design decisions for services and applications can be tricky problems all on their own, which might explain why host-based solutions have found their current place in the market. Any technology’s role in business continuity is to take a given implementation and operationalize it with resilience as a forefront concern; once validated to have reliably solved the problem at hand, the implementation is entrusted with the “real deal” data, contingency procedures are made official, and monitoring alerts go live. The unique position host-based solutions find themselves in during these endeavours yields some interesting insights, and the Anvil! is certainly no exception.
Good Job. Keep it up!
Given enough time, an organization gets its first true test of their approach to this problem and the, “Are we back online?” question gets a longer-than-short answer, revealing potential areas of improvement. When an outage is resolved, businesses look for more than just a responsive port, a successful connection, or even an active transmission. They look for their newest content, their latest orders, their most recent fulfillment data – they look for business process vital signs. Once organizations determine why they need not just highly available systems but storage as well, the question of how (and perhaps how much) becomes relevant.
Escaping ‘Ops Groundhog Day’
How do you measure the vital signs of your business operations with technology? Perhaps the more immediate question in the first place might be, “What sure signs of business process vitality are available to technology?”. The challenge is the same for services and applications as it is for hosts – even metrics that measure technological aspects of commerce like click through rates on their own can’t tell you the nature of the content served through the click action – whether it’s no longer accurate, no longer performant, corrupt, or otherwise compromised. One compound strategy to solve for both connection and content that works well for network resources is the combination of a network transport client monitor (Port at IP) as well as an application client monitor (Content in HTTP response). This approach has the added benefit of confirming your organization’s application is sound and its content is valid, giving other metrics useful context during a recovery scenario. For local applications, a similar strategy might look like a block storage client monitor as well as a filesystem content monitor, ensuring that a disk is not only available but also presents the expected content at the expected path.
Considering Various Strategies
Many times, organizations choose the approach of operating parallel and continuous deployment pipelines, perhaps spurred by prior load balancing efforts which already yielded the requisite infrastructure. It’s common for databases to be configured for replication in a primary-secondary configuration for persistent data, and for CI/CD pipelines to handle building the runtime environment, typically some sort of image. In essence, redundant software runtimes for connection availability, and redundant databases for content availability. As long as the investment in procedure development, documentation, and personnel training is sufficient to prevent config drift, it’s a solid strategy.
Other strategies seek more operational economy for the fault tolerant capabilities that suit their availability requirements. Redundant cluster block storage makes an excellent choice in this regard, as it encapsulates practically any manner of data storage service an application might require. This eliminates the need for a redundancy implementation tailored to each of these services like databases, caching, or distributed logs to achieve fault tolerance. What’s more, since the interface for fault tolerance is at the hypervisor level, deployment pipeline work can be isolated while uptime protection is maintained.
A Production-Proofed Approach
Fault Tolerant HA with hypervisors remains a key component in a comprehensive approach to uptime. More than just an active process or a live connection, the presence, and certainly the nature of, specific data when resolving an availability incident eases some of the worst pain points operations teams experience in recovery efforts. A high availability strategy that accounts properly for storage certainly makes the difference between the availability of a customer resource and the availability of your organization’s customer resource. All the same, that’s only a necessary component, and isn’t sufficient on its own. In the next article, we’ll discuss the reasons modern HA storage solutions have taken uptime protection even further.