Alert Message Please update your browser.

We don't support this browser version anymore. Using an updated version will help protect your accounts and provide a better experience. 

Update your browser

Please update your browser.

We don't support this browser version anymore. Using an updated version will help protect your accounts and provide a better experience.

Update your browser

Close

Resiliency architecture and testing, part 1: AWS Well-Architected and the reliability pillar

 

by Stephen Welsh

4 min read

Chase is planning for a significant workload migration of its systems to Amazon Web Services (AWS) over the coming years.

We have the opportunity to not only take advantage of the modern infrastructure with cloud, but also to design well-architected systems once we modernize our applications. But to achieve success with an AWS system and take full advantage of the processes to deliver software to AWS, we will have to align core concepts and definitions with the industry best practices. In this three-part series, I’ll focus on the availability concept of resiliency and the testing of availability through chaos experiments, then discuss how to establish availability requirements and add them to the deployment module in SEAL.

AWS Well-Architected and the Reliability Pillar

AWS Well-Architected helps cloud architects build secure, high-performing, resilient and efficient infrastructure for a variety of applications and workloads. Built around six pillars — operational excellence, security, reliability, performance efficiency, cost optimization and sustainability — AWS Well-Architected provides a consistent approach for customers and partners to evaluate architectures and implement scalable designs.

There are five design principles to help guide architects, engineers and site reliability engineers (SREs) in building reliable systems that their business partners can agree on.

  • Automatic recovery from failure: this could be an application, EC2 instances, availability zones (AZ) or relational database system (RDS)
  • Test recovery procedures: use automated chaos testing to impact or fail the workload and validate the recovery procedures
  • Horizontal scale: deconstruct workloads into multiple services to reduce the impact of a single failure
  • Manage capacity: monitor demand and workload use to provision instances appropriately
  • Manage automation change: changes to the automation that manages the infrastructure also needs to be tracked, reviewed and stored in a code repository

Reliability is, in turn, determined by three other things:

  • Resiliency: The ability to recover workload from infrastructure or service disruptions and dynamically acquire computing resources to meet demand
  • Availability: The percentage of time that workload is available for use.
  • Disaster Recovery (DR): The ability to recover workload on one-time events like natural disasters, large technical failures or attacks; the key measure is the recovery time objective (RTO)

In Part 2, we’ll explore resiliency within Chase. Stay tuned!