Alert Message Please update your browser.

We don't support this browser version anymore. Using an updated version will help protect your accounts and provide a better experience. 

Update your browser

Please update your browser.

We don't support this browser version anymore. Using an updated version will help protect your accounts and provide a better experience.

Update your browser

Close

Resiliency architecture and testing, part 2: Availability

 

by Stephen Welsh

4 min read

Chase is planning for a significant workload migration of its systems to Amazon Web Services (AWS) over the coming years.

We have the opportunity to not only take advantage of the modern infrastructure with cloud, but also to design well-architected systems once we modernize our applications. But to achieve success with an AWS system and take full advantage of the processes to deliver software to AWS, we will have to align core concepts and definitions with the industry best practices. In this three-part series, I’ll focus on the availability concept of resiliency and the testing of availability through chaos experiments, then discuss how to establish availability requirements and add them to the deployment module in SEAL.

Availability with a Foundation in SEAL

Currently in SEAL (an internal tool), the recovery time objective (RTO) is defined at an application level, and the RTO is assumed for every deployment module of the application. It is used for disaster recovery (DR), site reliability (SR) and availability tests today and is part of our reliability control procedures. However, RTO and availability are defined as solving two completely different problems.

For instance, if you have an application module that is five 9s (as in 99.999%, or just about 100%) with an RTO of two hours, you have an availability zone (AZ) outage where you can honor the five 9s because the application is deployed to multiple AZs. There is no need to leverage the RTO even if it takes days to recover the AZ outage because the application module is highly available. If you have a batch module with the application that is only two 9s, then it may be acceptable to compute in a single AZ since that module is allotted over three days of downtime per year.

Why Configure Availability in SEAL?

Outside of the RTO discussion, we do have some level of success with building highly available designs, test availability, observe availability and validate availability without a true availability requirement in a system of record (SOR). However, we should also acknowledge that HA designs are inconsistent and can be over-architected or under-architected. Tests from the development and operations teams will be inconsistent between apps and environments, and the tests are usually manually run and planned within maintenance windows. Observed progress of tests come in the form of report, which means they may not be enforced, and validation of proof of design and implementation may not match and be difficult to measure consistently.

Here’s what we can gain by having availability expectation requirements in SEAL:

  • Granular service agreements with the business that include cost and recoverability expectations for those services
  • Documentation for all types of measurements and enforcement of resiliency practices
  • Governance of design and testing patterns
  • Signal generation of breach of patterns tied to a controlled procedure
  • Ownership of patterns — product architects for design and SREs for chaos testing

And here’s how establishing the resiliency requirements of a deployable module flow:

There is an upfront cost with setting availability requirements in SEAL, but these costs will establish agreements that create long-term savings, beyond the implementation of SEAL and the updates required to Photon, PtX, AaC, Chaos Test Patterns and JET. Our application teams and business partners will need to be more granular in their agreements of each service, and there may be some testing overhead that is not in place today that teams will need to follow and that the business will need to understand. However, in the long run, the design and test patterns will be consistent, the test will be more automated, audits will be easier and reporting will be much more accurate.

In Part 3, we’ll explain how we established the Permit to X process for verifying and validating a minimum set of requirements before a new application component, module or service can take on customer traffic.