In early 2021, Azure CTO Mark Russinovich’s regular Ignite walkthrough of the Azure architecture gave us a first look at Chaos Studio, the platform’s flaw injection tool. Building on the chaos monkey concept introduced by Netflix, the growing discipline of chaos engineering aims to help developers understand what happens to cloud-scale applications when they fail.
As the second Ignite of 2021 opens its digital doors, Microsoft is unveiling the first public preview of Chaos Studio as part of its drive to deliver better and more resilient cloud applications. I had the opportunity to speak with Mark Russinovich prior to the preview launch about Azure’s approach to chaos engineering and how he sees developers benefiting from these technologies.
Add chaos to Azure
Chaos engineering in Azure is not new. As he says, “We’ve been doing chaos engineering and Azure since the beginning. There has been a lot of home-grown chaos.” But as the service grew, what started as tooling unique to specific teams had to become something that works for everyone who builds on and in Azure. He says, “Over the past few years, we’ve realized, ‘Hey, we need to consolidate these chaos engineering efforts into a common resource, a common framework service that we can apply to all of our services.’ ”
That common tool was the foundation for Chaos Studio, and although it started life as an in-house tool, Russinovich points out that the goal was always to become customer-centric. What customers need may not be what Microsoft needs, but the lessons they learn can make Azure better for all of its users, both inside and outside Redmond. “We think we not only give customers the benefits of a service that works for them, but we can grow an ecosystem to have this with customers. The extensibility they provide produces bug injections that we can then use across the ecosystem and even internally,” he says.
Introducing Chaos Studio
Chaos Studio is a tool that allows developers and testers to perform script error injections into running systems, starting with failing virtual machines and then presenting more detailed, lower-level errors, including CPU and memory stress. Errors are either agent-based, which requires a Chaos Studio agent as part of a VM build (both for Windows and Linux), or service-direct. After the agent and any prerequisites are installed, you can use Chaos Studio to choose the type of test you want to run and how you want to run it. For example, if you are stress testing the CPU, first define how long you want to add CPU pressure and how much pressure you want to add.
When you perform a stress test like this, you need tools such as Azure Monitor in addition to Chaos Studio to give you insight into what is happening with your systems. The same applies to service-direct failures. These are used to affect Azure resources, such as Cosmos DB, once you have associated a service with your Chaos Studio instance. Here you can set up a test to see how your application responds to, for example, a cross-region failover of a key service.
One of the most important aspects of a tool like Chaos Studio is its focus on an experimental approach to testing. This is essential when dealing with large-scale distributed systems where the underlying system health is unknown. Chaos Studio allows you to validate assumptions about application behavior. For example, you might want to build a test that validates what happens if an Azure zone goes down or you lose a server hosting a set of virtual machines.
Chaos as Science: Using Experiments
The essence of chaos engineering is building a hypothesis and then proving it to unravel the edge cases that can cause problems for your users. As Russinovich says, this part of building an observable, manageable distributed system becomes “really a platform to validate the behavior of the system, and it just doesn’t work without observability on the other side. If you can’t see what the test is doing, the test is useless. So it also tests your observability because you’d say, ‘hey, if it loses a few VMs or goes over x threshold, it should go off with a warning.’ Well, if that warning doesn’t go off, it’s because your observation systems aren’t tuned to catch those things you want to catch.
By using an experiment-oriented approach to chaos, it is considered a tool for continuously validating your applications. Chaos engineering may sound arbitrary, but it isn’t. You take an engineering-led approach to disrupting a complex system, with the intent of understanding the effects that disruption has on the system as a whole. Have you designed a shopping cart system that switches to a new instance if the e-commerce system crashes, or a customer loses all their purchases and has to repeat everything? You have an assumption about how your application works. With Chaos Studio you can test everyday actions and at the same time investigate what happens in more challenging environments.
These are what Russinovich calls “game day” events, using Chaos Studio to experiment with what-if scenarios. He describes how customers used the service in the preview: “Let’s just say [they have] an e-commerce application, which is distributed globally for high availability and resiliency, and an Azure region becomes inaccessible and the application in that region fails. How does the system behave? That’s kind of a game-day experiment that they’re going to run.”
With this type of use, you can build Chaos Studio experiments into your CI/CD pipeline, use them to prepare and test deployments alongside load generators before putting code into production. Here it becomes a means of validating deployments and their associated virtual infrastructures before releasing updates to the public. By using Azure private VNets to host your canary builds, you can quickly deploy, test, and degrade an instance, minimizing costs.
Continuous validation: the foundation of resilient cloud applications
An interesting point should be made here about the role of continuous validation (CV) as the third leg of a tripod, along with continuous integration and continuous delivery (CI/CD) as the foundation of distributed system devops. As engineers, our job is to build resilient applications in a non-deterministic environment. We build systems that run in dynamically self-scaling orchestrated networks of microservices, where services are shared between different applications and where concurrency and consistency make it difficult to determine what is causing a problem.
Russinovich is clearly excited about the capabilities of systems like this, noting that what comes with Chaos Studio’s public preview is just the start of something much bigger. “This is kind of the first step in a comprehensive system. It just gets more sophisticated over time.”
On one side of our applications are observation tools that allow us to deduce the status of an application from the many outputs. What Chaos Studio gives us, along with various testing frameworks, is a way to monitor more of the input to help us understand how changes in infrastructure and services affect our code. It’s clear from my conversation with Russinovich that Microsoft has plans to take Chaos Studio further and use it to test both services and infrastructure.
Since we think of cloud platform services as constituent infrastructure elements, this approach makes sense as concepts from security testing, such as fuzzing, are transformed into API testing. We need to be able to see what happens to a system when it receives incorrect input as much as we need to see what happens when an element fails. As Russinovich points out, if a system fails on Cyber Monday, it could have significant business ramifications. “[If it] is dropping and now I can’t process orders, that’s literally costing me millions of dollars an hour or tens of millions,” he says.
With so much business at stake, chaos engineering is becoming increasingly important for cloud architects. As systems become more complex, it is necessary to understand how they fail. Without that knowledge, we cannot build the resilient tools needed to support our businesses. By providing a common tool for injecting bugs into our systems, Microsoft gives us much of what it takes to add continuous validation to our build pipelines and to our CI/CD processes. Maybe someday we’ll have CI/CD/CV, but for now we can start exploring what system bugs really do to our code.
Copyright © 2021 IDG Communications, Inc.