Measuring DevOps success with the latest software delivery data

In order to measure DevOps success, you need industry benchmarks to quantify key software delivery success metrics. This year, we set the first-ever benchmarks for teams practicing CI/CD, based on 55 million data points from more than 44,000 organizations on our platform. What do high performing engineering teams look like, quantitatively?

As the world’s largest standalone CI provider, we have a unique opportunity to investigate what software delivery looks like quantitatively: across tens of thousands of teams, commit by commit. With real delivery data, we can see how teams are building and deploying software in practice.

Your team’s ability to deliver is a competitive advantage. Which numbers should you aim for to stay ahead of the curve?

Download 2020 State of Software Delivery: Data-Backed Benchmarks for Engineering Teams to find out what the most successful teams are doing to build better and faster.

4 key benchmarks to measure your team on

Our comprehensive data on engineering team performance has identified these four benchmarks for high-performing software teams:

  • Throughput: the number of workflow runs matters less than being at a deploy-ready state most or all of the time
  • Duration: teams want to aim for workflow durations in the range of five to ten minutes
  • Mean time to recovery: teams should aim to recover from any failed runs by fixing or reverting in under an hour
  • Success rate: success rates above 90% should be your standard for the default branch of an application

What is the purpose of measuring each of these success metrics? Let’s take a deeper look at what these metrics mean and why they’re so valuable to your team.

Key metric: Duration

Duration is defined as the length of time it takes for a workflow to run. It is the most important metric in the list because creating a fast feedback cycle (including Throughput and Mean Time to Recovery) hinges on Duration. In other words, you can’t push a fix, even a much-needed one, faster than the time it takes your workflow to run. Duration also represents the speed with which your developers can get a meaningful signal (“did my workflow run pass or fail?”). A short duration requires an optimized workflow.

Not all workflows produce the same end-state. For instance, some workflows only run specific tests depending on the part of the application codebase that changed. Duration, therefore, is not an explicit measure of how long it takes to deploy to production. It is just a measure of how long it takes to get to a workflow’s conclusion.

The ultimate goal of CI is fast feedback. A failed build signal needs to get to developers as soon as possible; you can’t fix what you’re not aware of. But awareness is not the only consideration. Developers also need information from their failed builds. Getting the right information comes from writing rigorous tests for your software.

It is important to emphasize here that speed alone is not the goal. A workflow without tests can run quickly and return green, a signal that is not helpful to anyone. Teams need to be able to act on a failure as quickly as possible and with as much information as they can get from the failure. Without a quality testing suite, workflows with short durations aren’t contributing valuable information to the feedback cycle. The goal, then, is rich information combined with short Duration.

Key metric: Mean Time to Recovery

Mean Time to Recovery is defined as the average time between failures and their next success. This is the second most important metric in the list: after you get a failed signal to your team, their ability to address the issue quickly is invaluable. Because Mean Time to Recovery improves with more comprehensive test coverage, this metric can be a proxy for how well-tested your application is.

Failed build, valuable signal, rapid fix, passing build: continuous integration makes these rapid feedback loops possible. The fast signals enable teams to try new things and respond to any impact immediately. Likewise, solid test coverage reduces the fear of introducing broken code into your production codebase, allowing you to challenge your engineering teams to be creative and nimble with the solutions they develop.

Key metric: Throughput

Throughput is defined as the average number of workflow runs per day. A workflow is triggered when a developer makes an update to the codebase in a shared repository. A push to your version control system (VCS) triggers a CI pipeline that contains your workflow.

The number of workflow runs indicates how many discrete units of work move through your application development pipeline. One component of throughput reflects the size of your commits: are you pushing many small changes or fewer large changes? The right size will depend on your team, but the goal is to have units of work small enough that you can debug quickly and easily but large enough that you’re deploying a meaningful change.

We recommend monitoring Throughput rates vs. setting explicit goals. It is important to see how often things are happening, and Throughput is a direct measurement of commit frequency. Fluctuations in Throughput can occur in situations like onboarding, where two devs may work through the same tasks together and push fewer commits as a result. Establishing baseline metrics for your organization can prepare you for this type of impact, allowing you to forecast engineering productivity through these predictable events. When you encounter unforeseen circumstances, your baseline is also able to help you determine the volume of work that went undone.

When a well-tested application is in a state where it can be deployed at any time, it’s because every new change has been continuously validated. Without a fully automated software delivery pipeline, a team is subject to deploy emergencies and fire drills, often at inopportune times (Friday nights, for example).

With a fully automated software delivery pipeline, it is up to you how frequently (and when) updates are delivered to your end-users: hot-fixes immediately; feature upgrades as they are developed; large-scale changes on a calendar set by your business demands. A particular number of deploys/day is not the goal, but continuous validation of your codebase via your pipeline is.

Key metric: Success Rate

Success Rate is defined as the number of passing runs divided by the total number of runs over a period of time. Git-flow models that rely on topic branch development (vs. default branch development) enable teams to keep their default branches green.

One important thing to note is that we expect to see high variability of success depending on whether the workflow is run on the default or topic branch. In many git-flow models, topic branches are where the majority of work is done, and therefore the majority of signal-generating passing and failing experiments.

By scoping feature development to topic branches, we can differentiate between intentional experiments (where failing builds are valuable and expected) and stability issues (where failing builds are undesirable). Success rate on the default branch is a more meaningful metric than success rate on a topic branch.

Set DevOps success benchmarks for your team

While there is no universal standard that every team should aspire to, our data and the software delivery patterns we’ve observed on our platform show that there are reasonable benchmarks for teams to set as goals. Ultimately, your ability to measure your baseline and make incremental improvements on these metrics is more valuable than chasing “ideal” numbers.

Learn more on how to measure DevOps success with our 2020 data report

Download 2020 State of Software Delivery: Data-Backed Benchmarks for Engineering Teams to find out how you and your team can amplify your software delivery going forward. Download the report here.