How to reduce flaky test failures

Testing helps you discover bugs before you release software, enabling you to deliver a high-quality product to your customers. Sometimes, though, tests are flaky and unreliable.

What is a flaky test?

Flaky tests, also known as flappers, fail to produce accurate and consistent results. These tests may be unreliable because of newly-written code or as a result of external factors.

Don’t let flaky tests slow you down.
Rerun failed tests now available.

Learn More

If your tests are flaky, they cannot help you find (and fix) all your bugs, which negatively impacts user experience. In this article, I will help you discover whether your tests are flaky and show you how to fix them.

How to discover flaky tests

Flaky tests result mostly from insufficient test data, narrow test environment scope, and complex technology. Some other factors that play a role in making tests unreliable are:

Asynchronous waits
Timeouts
Time of day
Concurrency
Test order dependency

The next sections of this article describe how to recognize when these factors are contributing to testing flakiness. There are also descriptions of how to prevent these types of flakiness.

Asynchronous waits

Developers often write tests dependent on specific data. If the tests run before the data loads, they become unreliable. This is particularly common in languages that use asynchronous APIs extensively, like JavaScript, C#, and Go.

For example, consider an integration test that fetches data from an external API. If application code makes an external API call asynchronously but does not explicitly wait until the data is ready, it may cause test flakiness. Sometimes, the data will be ready when it is needed by the test. Sometimes, it will not. Success or failure may vary depending on the speed of the machine the code is running on, the quality of its network connection, and many other factors.

Making tests wait can improve efficiency and reduce flakiness. However, waiting can also cause long system delays during automation tests. It is important to note that the code causing test flakiness due to improper asynchronous calls will usually be in the application code under test, not in the test itself.

Timeouts

On a related note, the problem with asynchronous waits is that they could last forever if something goes wrong. Setting timeouts can solve this problem, but timeouts introduce a new problem. For example, if there is a long delay in loading data from an API call, tests fail because the wait time exceeds the timeout limit. This also results in a flaky test.

Timeouts can occur while loading data for several reasons:

Calls to external APIs that are under heavy load
Data retrieval from hard disks on slow machines
File uploads from a computer to a server

Loading data from an API, reading it from a relatively slow disk such as EBS, or waiting for a file upload can take a variable amount of time, so sometimes timeouts occur because you have to set some kind of time limit. You cannot wait forever for something to finish. Sometimes timeouts occur, and other times tests run successfully without timing out.

To work around this problem, it is often a good idea to mock out calls to external systems to make sure that your tests are testing your code, and not the reliability of a third-party API. Use your best judgment, though. It is useful to know if, for example, a service your app integrates with is frequently slow or unavailable. In that case, the “flaky” test is providing you valuable information, and you may not want to make any changes.

Time of day

Sometimes, code behavior changes throughout the day. This means test success may depend on the time the test runs. For example, say we are writing automation tests for an appointment system that includes booking time slots for specific intervals. If we run this test on our production pipeline, it can produce different results at different times. Time slot availability is dependent on the time of the day, which adds flakiness to the test.

Concurrency

There may be data races, deadlocks, or concurrency issues in application code or in the tests themselves. When developers make incorrect assumptions about the order of operations that different threads perform, this can result in flaky tests.

Non-determinism in test execution is not necessarily an issue because there are several cases where multiple code behaviors are correct. When the test checks for only a subset of all possible valid behaviors, flakiness may be the result.

Test order dependency

A test might fail because of the test that runs before or after it. This happens because many tests use shared data, like state variables, inputs, and dependencies, simultaneously.

We need to completely remove or minimize the dependencies among these tests to improve accuracy and reduce flakiness. Wherever your test depends on another module, use stubs and mocks. Stubs are objects with predefined responses to requests. Mocks (also called fakes) are objects that mimic the working representation, but not at 100 percent of production. Mocking and stubbing creates tests that run in isolation.

How to prevent flaky tests

Here is a quick overview of strategies you can use:

Avoid asynchronous waits by watching for timeouts and delays when loading data
Mock out calls to external systems to prevent timeouts
Move the tests to the ideal time of day for your team
Fine tune the order of operations for maximum concurrency
Minimize test order dependency

Automating flaky test detection

Reviewing all your tests for flakiness, not to mention each specific flakiness factor, is a time consuming process. The easiest way to automate flaky test detection is to automate the system that runs your tests and collects and displays the data. Continuous integration and continuous delivery (CI/CD) tools like CircleCI can help.

What caused the failure: the application or the test?

There are only two ways for a test to fail. Either the test is flaky, or the application is not working as expected. It is essential to understand whether a test has failed due to flakiness or a genuine application issue. If it is a flaky test, we can optimize the test to reduce flakiness. If it is an application failure, you should report this valid failure to relevant stakeholders.

A success rate can be the single source of truth in the main branch because there should be no test failures on the main branch. It should have a 100 percent success rate. If the success rate is less than 100 percent, either the tests fail due to application issues, or the tests are flaky.

Take advantage of data from improved insight

Once you have an automated test runner you can start collecting data. CircleCI, for example, provides a Test Insights dashboard with flaky test detection that can help you analyze and diagnose intermittent test failures and occurrences. Test Insights provides a detailed overview of your 100 most recent test executions and will automatically flag tests that fail non-deterministically as well as those that are long running or fail most often.

CircleCI also provides webhooks for integrations with third-party monitoring and observability tools like Data Dog and Sumo Logic. This provides benefits like real-time job performance monitoring and advanced CI analytics with a user-friendly interface. These integrated dashboards provide identification of early indicators so developers can efficiently resolve issues affecting their applications.

Discover and communicate failure patterns faster

Dashboard panels include a job analytics overview to track and showcase real-time data like job status within a project, job outcome visualizations for the CircleCI pipeline, and a visualization of the top ten slowest successful jobs in the pipeline.

Tools like CircleCI detect flaky tests because automation lets you run tests much more frequently, making it easier to notice failure patterns you would miss otherwise. Automation also helps develop better techniques to detect flaky tests, reduce non-determinism, fix the test, label test failures as flaky or not, and prevent future flaky tests.

Shorten your time to feedback by rerunning flaky tests only

Detecting and troubleshooting flaky tests can disrupt team velocity and lengthen project timelines. Often, the fastest way of getting your pipeline back to green from a flaky test failure is to rerun your tests. Rather than wasting valuable minutes waiting for your entire test suite to execute, including those that passed, you can selectively rerun only the tests that failed.

CircleCI’s rerun failed tests feature allows you to execute a subset of tests from a previous workflow. It filters out passing tests, keeping only those that failed, and reruns them against your original commit. For teams that need to ship changes quickly, this feature shortens the amount of time to get to a passing build. Once your changes are safely in production, you can return to your flaky tests to troubleshoot.

Wrap up

Now that you know more about the danger of flaky tests, how to detect them, and how to fix them, consider an automated tool to help you test better. I hope I have convinced you to try using your CI/CD pipeline to automate the discovery of flaky tests. You can get started right away, by signing up for your CircleCI free trial today.