Feedback loops: the key to improving mean time to recovery

Red builds are a fact of life for DevOps teams: if you make well-tested software with automated tooling, some of your builds will inevitably fail. But slow recovery times from failed builds shouldn’t be taken for granted, since they damage a team’s ability to speed up development cycles. To shorten the mean time to recovery (MTTR), developers need data to troubleshoot problems. Continuous integration and continuous delivery (CI/CD) can help shrink the time it takes to restore service by providing data about your build quickly, creating the fast feedback loops DevOps teams need.

We’re quick to say that speed is never the sole goal of robust DevOps practices. As we explain in our recent report, The Data-Driven Case for CI: What 30 Million Workflows Reveal About DevOps in Practice, metrics like MTTR need to be understood in concert with other measures of quality, like lead times and deployment frequency; also, speed needs to be matched with quality. CI/CD helps accelerate development while also adding intelligence and insights to the DevOps process. Consistency of delivery is the unsung hero of automation: speed without reliable, consistent quality is not helpful.

In this post, the third in a series, we’ll examine how data from CircleCI workflows supports industry-standard metrics like MTTR. These metrics show that it’s possible to optimize for stability without sacrificing speed, with CircleCI’s data showing how teams operate against these metrics in real life.

Getting builds from red to green, in less time

Improving mean time to recovery requires you to get actionable information, fast. If your team can get failed statuses returned quickly, they can get builds from red to green in the shortest possible time. Adopting CI/CD practices enables rapid feedback loops, and is the best way to ensure fast signal for your developers. With a robust CI/CD practice, developers have access to real-time artifacts such as logs and coverage reports from your test suite, which give developers the opportunity to troubleshoot in an environment equivalent to the production environment.

To understand how observed development behavior compares with industry standards, we looked at CircleCI data from over 30 million workflows, observed between June 1 and August 30, 2019. The workflows represent:

1.6 million jobs run per day
More than 40,000 orgs
Over 150,000 projects

Here’s what we found:

The minimum time to recovery was recorded at less than one second.
The maximum time was 30 days.
The median time to recovery was 17.5 hours.

The scenario of a minimum recovery time of less than one second can only happen when two workflows are started nearly simultaneously, and one fails while the other passes. In any other scenario, it isn’t possible to get a signal and respond to that signal in that amount of time.

For the maximum recorded MTTR of 30 days, this is a function of CircleCI’s dataset cutting off recovery at the 30-day mark – not meant to indicate that recovery is occuring in 30 days.

The interesting data here is not the minimum time or the maximum time, but the median time to recovery, which is 17.5 hours – about the length of time between the end of one workday and the start of another. This implies that when engineers get a failing signal at the end of the day, they wait until the following day to resolve it and may provide insight into how DevOps teams operate.

What’s the takeaway from our data?

The 17.5-hour median value goes against the conventional wisdom that is reflected in polls and surveys indicating that MTTR is much more rapid. What’s driving the disparity between reported and observed behavior? Our guess is that survey design itself is influencing a speedy reported MTTR.

Survey questions about engineering metrics are often worded in this way: “For the primary application or service that you work on, tell me about _____?” However, CircleCI’s data covers all workflows for all branches, not just the default branch and not just on the primary application. While the median observed time to recovery is 17.5 hours, it’s likely that the mean time to recovery on your default branch of your primary application is faster – more like minutes instead of hours.

According to our workflow data, over 30 percent of active project branches never failed (that is, were 100 percent green) over the 90-day observed time period – meaning that MTTR is 0 seconds. Since they never failed, they never needed to recover.

Other interesting data points from our workflows show that 50 percent of all recovery happens in under one hour, and that 25 percent of organizations recover in under 15 minutes. We also noted that 50 percent of organizations recover in one try, and 75 percent recover within two tries.

The top 10 percent of performers spent less than 10 minutes doing the work necessary to fix their builds and get back to green. This supports the time-to-restore metrics for elite performers found in the State of DevOps 2019 report. Elite teams, the report noted, show a mean time to recovery of less than 1 hour, compared to low performers with recovery times between one week and one month.

It’s also interesting that there’s a large frequency gap in the data between three hours and the median time of 17.5 hours, which suggests that if a red build is not recovered in under three hours, it will likely not be fixed until the following day.

Every little bit of CI helps

Improving MTTR is one step towards becoming a high-performance DevOps organization. And, as we firmly believe at CircleCI, even a little bit helps in terms of improving performance. Even if you’re just starting out with CI/CD – and you’re not doing it perfectly – you’re still on the right track.

In our next and final post in this four-part series, we’ll look at change fail rate. And if you missed the first two posts, read about lead times and deployment frequency.