As we’ve been discussing in our blog post series about CircleCI’s new report, The Data-Driven Case for CI: What 30 Million Workflows Reveal About DevOps in Practice, speed isn’t everything in DevOps – at least not on its own. Just as speed isn’t a good metric in a vacuum, more green builds don’t necessarily indicate high performance. In this post – the fourth and last in our series exploring metrics for DevOps high performers – we’ll look at change rate failure (also see our lead time, deployment frequency, and mean time to recovery posts in the series). Much like with measurements around speed, there’s more to the change rate failure metric than meets the eye.
As explained in the State of DevOps 2019 report, change fail percentage, or change failure rate, is defined this way:
“For the primary application or service you work on, what percentage of changes to production or released to users result in degraded service (e.g., lead to service impairment or service outage) and subsequently require remediation (e.g., require a hotfix, rollback, fix forward, patch)?”
The change fail percentage allows DevOps teams to gauge their progress on the high-performance journey – one that’s aided by adoption of continuous integration and continuous delivery (CI/CD). As we examined workflow data to analyze how standard industry metrics actually match up in practice, we found ample evidence that CI/CD provides a clear path to becoming a high-performing team:
- Teams using CI are incredibly fast: 80% of all workflows finish in less than 10 minutes.
- Teams using CI stay in flow and keep work moving: 50% of all recovery happens in under an hour.
- 25% of orgs recover in 10 minutes.
- 50% of orgs recover in 1 try.
If a team produces code without errors, it doesn’t always follow that they are a high-performing team. In fact, red builds are an everyday part of the development process for teams of every skill level. The key is being able to act on failures as soon as possible, and to glean information from failures to improve future workflows. (There are reasons for red builds, and you need to uncover those reasons.)
Why topic branches are the best place to improve signal
According to metrics like those explored in the State of DevOps 2019 report, the highest performing teams rarely push bad code to their default branch. But don’t take that to mean that these teams never write faulty code. These teams perform testing and security checks on a separate branch; only when everything passes is a merge to the default branch allowed to take place.
This is a good DevOps practice – one that originated with Vincent Driessen’s Git-flow model. Teams should know that their code works well before it’s merged into the default branch. Topic branches are the best place to get the fastest signal – and they are where it’s safe to fail. Change failure rate on topic branches is higher since this is where the majority of the work is being done, and because failure on these branches won’t bring down the default branch. Failure on these branches only impacts the people working on the same branches versus the entire codebase or product.
We also observed other teams using trunk-based development, another common development strategy where team members develop right onto the default branch. This strategy is optimized for recovery time – when a build is red, everyone works to recover. However, we did not see this method used nearly as often in the data.
What the data tells us
To understand how observed development behavior compares with industry standards, we looked at CircleCI data from over 30 million workflows observed between June 1 and August 30, 2019. The workflows represent:
- 1.6 million jobs run per day
- More than 40,000 orgs
- Over 150,000 projects
Here’s what we found:
- Overall, 27% of all workflows on CircleCI fail.
- Topic branches have an average failure rate of 31%.
- If we look only at default branches, the change failure rate is down to 18%.
- 50% of projects never had a failure when there were configuration changes to the circle.yml file orchestrating the CI/CD process.
Some DevOps experts believe that each branch should always be green. But in our view, red builds are fine as long as teams are able to recover quickly. A failing build doesn’t have to signify a problem: It means that your CI system is working and is giving you valid data.
If we look at failure by branch type, we can gain more insights into what failures really mean. For example, it’s no surprise that the failure rate on default branches is lower. Our data found that over 30% of active project branches never failed over the 90-day time period we observed – supporting our belief that topic branches are where the majority of change sets are being committed and validated prior to mainline integration. When you merge to the default branch after extensive pre-work on topic branches, you have a better understanding of the change that your code is going to produce.
Our finding of change failure rates of 18% for default branches is in line with the State of DevOps 2019 report’s metrics for high performers, showing that elite teams experience a change failure rate of 0-15%.
As for the finding that 50% of projects never had a failure when there were configuration changes to the circle.yml file, we were surprised. One possible explanation is that configuration is reused. For instance, an organization with many projects with the same configuration will update one and duplicate it once the changes pass testing. Another explanation is the use of CircleCI orbs, reusable, shareable configuration packages that teams can use to add functionality without fear of failure because they’re tested by the author and validated by the community. Regardless of how this low rate of failure comes about, it’s significant for CircleCI customers, since it counters the widely held notion that configuration changes are hard and require frequent updates.
The incremental value of CI/CD adoption
We know that optimizing on a single DevOps metric or selecting a particular tool doesn’t turn you into a high-performance team: in fact, adopting CI/CD can increase failure rates as you grapple with these changes across an organization. At least in the short term. Change on this scale isn’t easy! If you’re able to reduce change failure rates, it’s a step in the right direction. And as you improve, you’ll move along the road to producing the best software, consistently, and at high velocity. Simply adopting CI/CD principles puts you on the road to improved performance.