by Stephen Welsh
4 min read
Chase is planning for a significant workload migration of its systems to Amazon Web Services (AWS) over the coming years.
We have the opportunity to not only take advantage of the modern infrastructure with cloud, but also to design well-architected systems once we modernize our applications. But to achieve success with an AWS system and take full advantage of the processes to deliver software to AWS, we will have to align core concepts and definitions with the industry best practices. In this three-part series, I’ll focus on the availability concept of resiliency and the testing of availability through chaos experiments, then discuss how to establish availability requirements and add them to the deployment module in SEAL.
Availability with a Foundation in SEAL
Currently in SEAL (an internal tool), the recovery time objective (RTO) is defined at an application level, and the RTO is assumed for every deployment module of the application. It is used for disaster recovery (DR), site reliability (SR) and availability tests today and is part of our reliability control procedures. However, RTO and availability are defined as solving two completely different problems.
For instance, if you have an application module that is five 9s (as in 99.999%, or just about 100%) with an RTO of two hours, you have an availability zone (AZ) outage where you can honor the five 9s because the application is deployed to multiple AZs. There is no need to leverage the RTO even if it takes days to recover the AZ outage because the application module is highly available. If you have a batch module with the application that is only two 9s, then it may be acceptable to compute in a single AZ since that module is allotted over three days of downtime per year.