Chaos testing: Reliability for cloud-native apps

Reliability is a critical concern for software delivery teams. Every second of lackluster performance or service interruption comes with high costs. A 2023 report found that IT outages cost organizations up to $1 million per hour. The consequences often extend beyond monetary expenses and have a huge impact on a company’s reputation.

Though it may never be realistic to test for every possible circumstance, the importance of creating reliable applications has created a huge incentive for developers to implement extensive software testing throughout the development life cycle to squash bugs and simulate failures.

As applications — especially cloud-native applications — become more interconnected and dependent on a diverse assortment of systems, predicting how a system will respond to failures has become even more complex. Cloud-native applications rely on many dependencies and microservices operating across multiple environments. To understand how these systems affect one another, developers have had to develop novel solutions and strategies for testing software.

Chaos testing is one method for finding weak points before they cause problems. Creating controlled chaos and simulating realistic failures gives developers and site reliability engineers (SREs) a chance to figure out why and how outages might happen.

What is chaos testing?

Think of chaos testing as one component of a larger framework - chaos engineering. Chaos engineering creates hypotheses about how an application might perform under stressful conditions, then subjects the application to chaos tests that simulate actual failure conditions.

A chaos test might be as broad as shutting down a virtual machine or preventing access to a microservice. Other chaos tests take a more refined approach, introducing problems — such as latency or connection errors — to study how these factors can impact application performance or lead to outages.

Origins of chaos engineering

Although developers have been finding ways to simulate failures for quite some time, the modern concept of chaos engineering began in the early 2010s at Netflix. An engineer named Greg Orzell had the clever idea of creating a tool that would terminate random instances within Netflix’s production environment — allowing the team to stress test their applications.

Orzell’s tool — now known as Chaos Monkey — eventually found its way outside of Netflix and was introduced to the public in 2012. While novel, Chaos Monkey was an unsophisticated tool that worked by simply shutting down random server groups in a production or testing environment.

The Chaos Monkey project later expanded into an entire suite of tools capable of creating many different types of chaos tests. Known as the Simian Army, this project brought tools such as Latency Monkey, Conformity Monkey, Security Monkey, and various chaos testing capabilities to complement the basic functionality of Chaos Monkey.

Benefits and challenges of chaos testing

With proper planning and implementation, chaos tests can identify errors and provide invaluable insights into applications and their environments. Here are a few major benefits:

Understanding system operations: A well-engineered chaos test can reveal many valuable insights into how applications respond to emergent situations. Before performing a chaos test, engineers first measure stable conditions, and then formulate a theory about how the system will handle a particular type of stress. By comparing the experiment’s results to the theoretical model, teams can gain insight into how their systems work and what can be improved.
Enhancing reliability: Chaos testing can often reveal flaws that have the potential to compromise reliability. Systems often respond to failures in hard-to-predict ways. For example, a chaos test might reveal how seemingly unrelated microservices affect each other during an outage. This is particularly useful for cloud-native applications with many discrete services operating to produce a single application. With software growing ever more complex and interconnected, simulating issues is often the quickest way to understand how disparate parts of the system impact one another.
Stress-testing incident response: Ensuring reliability is equal parts proactive and reactive. While much of chaos engineering focuses on gathering information to make proactive improvements in reliability, it can also assist with the reactive side: incident response. By simulating an actual incident through chaos tests, organizations have a rare opportunity to see their incident response strategy in action, allowing them to evaluate its performance and make adjustments to better prepare for real incidents.

However, chaos testing is not without its challenges, including:

Failure to hypothesize and model: For a chaos test to be effective, it is vital to first understand the system’s regular operations and how it will respond to a chaos test. Without a clear hypothesis and model, the results can be uncertain, and insights from the chaos test may be limited. Therefore, it is essential to emphasize the importance of properly planning a chaos test.
Unintended damages: Occasionally, a test’s “controlled” chaos can create unintended consequences. When testing in a production environment, a simulated incident has the potential to become an actual incident if the “blast radius” (the worst-case scenario of a given test) is not properly contained. Steps should be taken to ensure that chaos tests impact only the intended systems. Testers should be able to stop chaos tests and return the system to its nominal operating state without causing damage.
Insufficient observability: For chaos engineering to be effective, testers rely on robust observability tools to monitor and record the impact of the test. Without the right tools to monitor performance and collect system metrics, chaos testing can be a wasted effort that fails to generate the data needed to improve system reliability.

Implementing chaos testing as part of DevOps

Chaos engineering experiments have quickly become a staple of testing for reliability. With chaos engineering rising in popularity amongst DevOps practicioners, new tools and techniques have evolved to allow for more effective chaos testing.

Platforms like Gremlin and Chaos Mesh enable teams to design, execute, and measure chaos tests. These tools can even help to automate the process of chaos testing. You can set up tests on a regular schedule or configure it to run at random times to simulate an unpredictable failure.

Running automated chaos testing on a schedule allows teams to gain regular insights into how their system handles emergent situations over time. Consistent, scheduled testing helps teams understand the evolution of their reliability and incident response by regularly subjecting the system to simulated failures.

Because chaos tests are often run in production environments, teams need to design chaos testing methodologies that isolate the effects of the test. Chaos engineers should attempt to understand and contain a test’s “blast radius”. When preparing a chaos test, engineers should focus on the potential consequences of conducting the test for the directly affected part of the system and any other interconnected services.

For example, a test impacting a microservice that delivers front-end content may be able to run without many unexpected effects. In contrast, a test impacting a fundamental networking service could result in more far-reaching and unpredictable problems.

Enhancing cloud-native reliability

Whether you want to ensure the reliability of mission-critical systems, or just better understand how an application handles failures, chaos testing can provide invaluable data. In cloud-native systems where microservices are heavily dependent on one another, even a minor problem can have indirect consequences that impact distantly connected services. Fortunately, modern chaos engineering tools can help developers to identify how these systems impact one another in the event of an outage.

When it comes to improving reliability or stress-testing incident response, there is no exact model for an unplanned event that causes an outage. But a well-designed chaos test can provide a close second, especially if designed to simulate an actual incident. Chaos testing allows engineers to explore typical or highly unusual circumstances in a controlled environment, providing insights that might otherwise only be found in the aftermath of an expensive incident.