Testing is vital because it helps you discover bugs before you release software, enabling you to deliver a high-quality product to your customers. Sometimes, though, tests are flaky and unreliable.

Tests may be unreliable because of newly-written code or external factors. These flaky tests, also known as flappers, fail to produce accurate and consistent results.

If your tests are flaky, they cannot help you find (and fix) all your bugs, which negatively impacts user experience. In this article, I will help you discover whether your tests are flaky and show you how to fix them.

How to discover flaky tests

Flaky tests result mostly from insufficient test data, narrow test environment scope, and complex technology. Some other factors that play a role in making tests unreliable are:

  • Asynchronous waits
  • Timeouts
  • Time of day
  • Concurrency
  • Test order dependency

The next sections of this article describe how to recognize when these factors are contributing to testing flakiness.

Asynchronous waits

Developers often write tests dependent on specific data. If the tests run before the data loads, they become unreliable. This is particularly common in languages that use asynchronous APIs extensively, like JavaScript, C#, and Go.

For example, consider an integration test that fetches data from an external API. If application code makes an external API call asynchronously but does not explicitly wait until the data is ready, it may cause test flakiness. Sometimes, the data will be ready when it is needed by the test. Sometimes, it will not. Success or failure may vary depending on the speed of the machine the code is running on, the quality of its network connection, and many other factors.

Making tests wait can improve efficiency and reduce flakiness. However, waiting can also cause long system delays during automation tests. It is important to note that the code causing test flakiness due to improper asynchronous calls will usually be in the application code under test, not in the test itself.

Timeouts

On a related note, the problem with asynchronous waits is that they could last forever if something goes wrong. Setting timeouts can solve this problem, but timeouts introduce a new problem. For example, if there is a long delay in loading data from an API call, tests fail because the wait time exceeds the timeout limit. This also results in a flaky test.

Timeouts can occur while loading data for several reasons:

  • Calls to external APIs that are under heavy load
  • Data retrieval from hard disks on slow machines
  • File uploads from a computer to a server

Loading data from an API, reading it from a relatively slow disk such as EBS, or waiting for a file upload can take a variable amount of time, so sometimes timeouts occur because you have to set some kind of time limit. You cannot wait forever for something to finish. Sometimes timeouts occur, and other times tests run successfully without timing out.

To work around this problem, it is often a good idea to mock out calls to external systems to make sure that your tests are testing your code, and not the reliability of a third-party API. Use your best judgment, though. It is useful to know if, for example, a service your app integrates with is frequently slow or unavailable. In that case, the “flaky” test is providing you valuable information, and you may not want to make any changes.

Time of day

Sometimes, code behavior changes throughout the day. This means test success may depend on the time the test runs. For example, say we are writing automation tests for an appointment system that includes booking time slots for specific intervals. If we run this test on our production pipeline, it can produce different results at different times. Time slot availability is dependent on the time of the day, which adds flakiness to the test.

Concurrency

There may be data races, deadlocks, or concurrency issues in application code or in the tests themselves. When developers make incorrect assumptions about the order of operations that different threads perform, this can result in flaky tests.

Non-determinism in test execution is not necessarily an issue because there are several cases where multiple code behaviors are correct. When the test checks for only a subset of all possible valid behaviors, flakiness may be the result.

Test order dependency

A test might fail because of the test that runs before or after it. This happens because many tests use shared data, like state variables, inputs, and dependencies, simultaneously.

We need to completely remove or minimize the dependencies among these tests to improve accuracy and reduce flakiness. Wherever your test depends on another module, use stubs and mocks. Stubs are objects with predefined responses to requests. Mocks (also called fakes) are objects that mimic the working representation, but not at 100 percent of production. Mocking and stubbing creates tests that run in isolation.

Automating flaky test detection

Reviewing all your tests for flakiness, not to mention each specific flakiness factor, is a time consuming process. The easiest way to automate flaky test detection is to automate the system that runs your tests and collects and displays the data. Continuous integration and continuous delivery (CI/CD) tools like CircleCI can help.

There are only two ways for a test to fail. Either the test is flaky, or the application is not working as expected. It is essential to understand whether a test has failed due to flakiness or a genuine application issue. If it is a flaky test, we can optimize the test to reduce flakiness. If it is an application failure, you should report this valid failure to relevant stakeholders.

A success rate can be the single source of truth in the master branch because there should be no test failures on the master branch. It should have a 100 percent success rate. If the success rate is less than 100 percent, either the tests fail due to application issues, or the tests are flaky.

After you automate running tests and collect data, CircleCI, for example, enables you to create dashboards that help analyze the intermittent test failures and occurrences. These dashboards provide a detailed workflow overview, and project insights with a detailed analysis of build status and build performance.

CircleCI also provides dashboards for Sumo Logic integration. This provides benefits like job performance optimization and advanced CI analytics with a user-friendly interface. These integrated dashboards provide identification of early indicators so developers can efficiently resolve issues affecting their applications.

Dashboard panels include a job analytics overview to track and showcase real-time data like job status within a project, job outcome visualizations for the CircleCI pipeline, and a visualization of the top ten slowest successful jobs in the pipeline.

Tools like CircleCI detect flaky tests because automation lets you run tests much more frequently, making it easier to notice failure patterns you would miss otherwise. Automation also helps develop better techniques to detect flaky tests, reduce non-determinism, fix the test, label test failures as flaky or not, and prevent future flaky tests.

Wrap up

Now that you know more about the danger of flaky tests, how to detect them, and how to fix them, consider an automated tool to help you test better. I hope I have convinced you to try using your CI/CD pipeline to automate the discovery of flaky tests. You can get started right away, by signing up for your CircleCI free trial today.