At CircleCI, CI has a second meaning: Continuous Improvement. We continuously seek out feedback not only to improve our code but to improve our processes and get better at our jobs along the way. This Continuous Improvement starts with one important company value: a blameless culture. Our blameless culture extends into every part of how we operate. It allows us to build trust among our different engineering teams, and it’s crucial to how we approach incidents, learn from them, and prevent them from happening again. A blameless culture means we lay the blame on processes, not people.
To help you understand what a blameless culture feels like from the inside, we want you to hear from employees at every level of the organization.
The engineer’s perspective - Tyler McGoffin
Recently, changes my team made to the codebase caused an incident on the CircleCI platform. But I was never worried I would be blamed. The blamelessness that we prioritize empowered me to focus on what was most important — resolving the incident and ensuring it doesn’t happen again.
Before an incident occurs
Blamelessness lets me get my job done without worrying that a faulty change will reflect poorly on me. There’s an understanding that if I introduce a change that causes an incident, it was only a matter of time before that incident would occur anyway. It could have just as easily been another team member who picked up the ticket and introduced the breaking change. This mentality allows me to do my best work without fear of being reprimanded if something goes wrong.
Debugging an incident
When something does go wrong, the most important priority is to mitigate the impact on our customers and colleagues. If my breaking change wakes up a teammate in Europe, we’re all focused on getting them back to bed as quickly as possible by solving the issue fast and mitigating the impact on our customers, rather than finding out who caused the problem.
By focusing on solving the problem, rather than placing blame, everyone is motivated to jump in and help debug the issue. No one hesitates to throw out ideas about what may be causing the problem, or dive down rabbit holes to find more context. I’ve never been blamed for a bad idea or received any ill will for chasing down a red herring. Everyone involved is focused on what matters most — getting our systems back up and running.
After an incident
Now comes the fun part. We’ve mitigated the incident and it’s time to figure out exactly what happened. Once the root cause is discovered, we don’t stop asking questions there. It would be easy to lay the blame on the guilty commit, mandate that it gets fixed, and make a note on my career progress that I made this happen. But that would be placing blame on the person. We look deeper at the why.
- Why did this commit break the app? Was there a weakness or limitation that this exposed? Instead of just fixing the commit, do we need to fix the bigger system?
- Why didn’t we catch this with CI? Was there testing or alerting that could have been in place to prevent this?
- Why weren’t these weaknesses discovered before? Was this truly an unknown or did something fall through a previous incident retrospective that could have addressed this?
An incident means that we’ve lost some trust in the system, but that trust can be recovered. An incident retrospective creates actionable items so that I can trust the process to cover me, know that we’ve learned from the situation, and improved our processes and systems going forward. Everyone did their best with the knowledge they had at the time.
This is how we focus on process instead of people to address incidents, and it makes me a more confident engineer in every part of my job.
The engineering manager’s perspective — Jace Proctor
As a manager, it’s one thing to articulate aspirational culture with presentations and documents; what you actually enforce, allow, and encourage is something else entirely.
As an engineering manager at CircleCI, I help enforce our blameless culture. I ensure that we focus on processes and improvements rather than specific individuals or actions. During incident mitigation or a retrospective, we make sure that the tenor and the topic are focused on why and how, not on what or who, even when we’re looking backward to figure out what happened.
One way we do this is with the Five Whys technique for retrospectives. Individual actions and mistakes are almost always the results of deeper foundational problems, but conversation naturally tends to center around superficial mistakes. I help my team by pushing the discussion deeper and identifying root causes using the Five Whys framework.
Here’s a quick example:
“The database went down and our app failed.” Why? “The recent code we committed brought down the database.” Why? “It pushed our database beyond capacity with a bad loop.” Why? “We didn’t write tests before committing so we didn’t catch it before it hit production.” Why? “We were up against a really tight deadline to ship this feature and had to cut corners.” Why? “We just don’t have enough people.”
This is a simplified example, but you can see how repeatedly asking “why” and pushing the discussion further helps us get past the superficial results and get to more systemic root causes.
This also forces the discussion beyond individuals or individual actions. It’s not the responsibility of individuals to fix systemic problems that go far beyond the scope of their authority or expertise. We help teams dig deep to arrive at those foundational issues and then we as managers, alongside senior leadership, can take action to solve them for everyone’s benefit.
The CTO’s perspective — Rob Zuber
As a leader, your goal is to build a high-performing organization, and high-performing teams can only exist when everyone’s well-being is prioritized. This is proven in data from Google’s Project Aristotle and the book Accelerate (Forsgren et al). This data directly correlates team performance with psychological safety.
The richest information about how an organization can be improved comes from failures. When people feel safe to talk openly about failures, they expose those areas for improvement and take action to make things better. Without that safety, those issues are never surfaced.
It’s not about incidents
Incidents are the microscope under which we tend to examine blameless culture because they are usually acute, extremely visible, and the highest priority. They are also extremely stressful. There is a clear distillation of behavior on display during and after an incident.
But the culture that shows up in your incident response is going to be a direct reflection of the culture that you build every day, with every action, under the most mundane circumstances. For example, if you blame the recruiting team every time a hire doesn’t go to plan, you won’t be able to walk into a post-incident review, call it a “safe space,” and have your employees be honest about what went wrong. Instead, if your response to the recruiting team about a wayward hire is, “What additional information might have helped you to make a different decision here?” it’s clear that you’re trying to solve the problem rather than blame the person. Your team will notice that.
Examining the system
When we examine an incident to determine areas for improvement, one of the most common questions we ask is, “In what ways did the system fail us?” It can be phrased many different ways but the point is clear: the same humans operating within a different system would have different outcomes.
This is where we break ties with blame. If the same humans would get different outcomes in a different system, then let’s talk about the system.
As a leader, walking into this moment can get really uncomfortable if you’re not ready for it. Placing blame on an individual is easy and it feels safe. It’s also super lazy. As you start to examine the system, you find that the design of the system — the collection of explicit or implicit values, priorities, and decisions over time that created your current context — was not the work of the individuals involved in the incident. It was the work of organizational leaders, including yourself.
How you choose to deal with that fact is at the center of building a blameless culture. Openly discussing the context and mental model that got you to that system, along with ways it can be improved, sends a very important signal about your own willingness to do the work of improvement.
One of my favorite resources on systems is Thinking in Systems, by Donella Meadows in which she says, “Systems can’t be controlled, but they can be designed and redesigned.”
The system behaved according to its design. Take responsibility for your role in the system design and find a way to modify the design to produce better outcomes… without blame.
The bottom line
Blameless culture allows CircleCI’s engineers to solve problems faster because there is no fear of retribution. When our engineers uncover weaknesses in our systems through error, they learn from those mistakes, which helps us continuously iterate and improve our processes and product.