Achieving Data Autonomy, Part 2

January 1 — 11 min read

Our plan was to replace the heritage extract, transform, load (ETL) tooling with a single, cloud-native solution that solved the key challenges we identified. The data technology leadership, along with our architecture and engineering leadership, said “Great. Prove it.”

We had to prove

Spark Java could handle complex transformations and integrations at a scale as fast as or faster than our vendor tools
Spark Java was a better solution than the other options via the Spark API
600 ETL developers using point-and-click tools could be reskilled to adopt the solution

Challenge accepted!

We had a very small development team of three people, plus an ETL developer with no Java experience from our centralized data team (soon to become one of our primary clients) to work on the proof of concept (POC) and subsequent pilot. We wanted to have a client embedded in our build team to ensure we never left our existing developers behind. We had to build a framework that was easily adoptable by the most junior software engineers that didn’t compromise our architecture and engineering standards.

We began building the early software development kit (SDK) for the framework using a data flow from our Home Lending line of business. We felt that this would represent the level of complexity and size of our flows as an initial set of requirements. The team would build out pieces of the SDK, then hand it over to the ETL developer to see if they understood intuitively how to use the framework (they had previously taken an online course for Java fundamentals through our learning center). Although the team provided guidance where needed, our client was able to use the framework and built a comparable flow to the original built-in the vendor tool. The new flow also had fully automated test coverage! Through this partnership, the team received immediate feedback from the client not only on the functionality but also the ease of use.

After five months, the initial pilot was built. The sample flow could process about three times the amount of data in half the time as the original vendor-built solution. Our leaders were thrilled!

Full Productization

As we left the pilot phase and embarked on our modernization journey at scale, it was important to think about true product architecture and boundaries in the context of our Data & Analytics products.

Our existing heritage technology stacks went beyond data movement into metadata, business data registration, access control, storage platform onboarding and access, archiving and purging (data retention) and more. We had to all agree that we were not “replacing” heritage functionality with a new thing. We were all – together – fundamentally modernizing how we manage data at the bank. Chase organized into a product structure, so within the Data & Analytics group, we needed to ensure functionality was properly slotted into the right product architecture. This ensured shared functionality existed only once, while also mitigating churn within the teams due to feature requests that were out of scope. Product owners and technology partners know whom to shift a feature request to when it was not within their product boundaries and architecture.

For our data pipeline, we focused only on data movement and transformation and the metadata required to operate a pipeline. We worked with our peer products to review existing heritage functionality and ensure functionality that was required in target state was in the appropriate product roadmap. We still occasionally have missing features come up for discussion and because we have a well-understood product architecture, the discussions are significantly easier than they once were!

As we populated the product backlog with the intended functionality, we worked with our biggest client to identify an MVP pilot for end-to-end functionality. We selected an application that was representative of the overall landscape for data lake use cases – our initial focus – and documented what within the overall feature backlog was required to fulfill the minimum viable product (MVP).

We spent approximately six months delivering the MVP to production in May 2020. The MVP needed to be able to perform all of the following functions:

Collect the technical data set metadata required to describe the data sets being used in the pipeline
Ingest ASCII fixed-width data to the data lake
Use the previously constructed SDK to perform transformation
Provision data to a consumption platform

This resulted in a series of application programming interfaces (APIs) to describe the data sets and pipelines (by type), then functionality to execute the functionality, apply data governance and controls to the process being performed and produce evidence.

Over the next two years, we continued to enhance the functionality based on direct client input – we never build a feature without a client use case. We have also evolved the data pipeline to manage use cases beyond the data lake, becoming the solution for data movement across the bank for everything from batch and micro-batch to streaming, across our on-premises cloud environment and off-premises public cloud. While our goal is for all data publishers to publish events in real-time, our pipeline solution had to be able to accommodate the transition and build functionality for things like Extended Binary Coded Decimal Interchange Code (EBCDIC) and extremely large files.

To avoid becoming the bottleneck to modernization for the entire bank, we also had to champion inner source development. We have built a robust inner source model that has been used as a show case for other teams to adopt. We built our product with this model in mind from the first day of development, ensuring our source code repository was open for reading to everyone in the firm.

Additionally, we championed pride of ownership in our high-quality engineering practices. Losing our client’s trust is the worst possible outcome for our team. Because of this, we focus on quality in not only how we write the code and our tests, but also how we write our backlog. Our product office works directly with clients to develop robust features in our backlog, with clear acceptance criteria written in Gherkin. This is then turned into acceptance tests as the first step when a team accepts a feature into their backlog for delivery. This has allowed us to be so confident in our code that we are able to push code to production every day on average.

Want a career in data?

Please update your browser.

Achieving data autonomy, Part 2: Our pilot phase and the beginning of a modernization journey

We had to prove

Challenge accepted!

Full Productization

Checking Accounts

Savings Accounts & CDs

Credit Cards

Mortgages

Auto

Chase for Business

Investing by J.P. Morgan

Chase Private Client

About Chase

Sports & Entertainment

Chase Security Center

Other Products & Services:

Chase Survey

You're now leaving Chase

Please update your browser.

Achieving data autonomy, Part 2: Our pilot phase and the beginning of a modernization journey

Data Flow Diagram

We had to prove

Challenge accepted!

Full Productization

Diagram of a Data Lake

Checking Accounts

Savings Accounts & CDs

Credit Cards

Mortgages

Auto

Chase for Business

Investing by J.P. Morgan

Chase Private Client

About Chase

Sports & Entertainment

Chase Security Center