Reliability update 2023-08-23
Overall, the July numbers reflect continued progress in the right direction. The number of incidents, the number of incidents over 60 minutes, and the overall build impact are all down month-over-month. Ultimately, this outcome is good for you, our customer. While we will continue to focus on the impact of incidents, we’re now expanding our focus in a new phase of this journey.
The numbers that we have been reporting so far are primarily oriented around availability or uptime. Those are important because when we have major service degradations, our customers are impacted. That said, availability numbers aren’t the full picture. Large systems are usually not “up” or “down,” but rather fluctuate in their performance. We are digging deeper into the areas in between incidents to ensure all of our customers are having great experiences and ultimately looking to share that view as part of this series. We’ll share more details on those learning in next month’s update.
As we continue to improve, we will continue to set the bar higher. Because ultimately, every build matters.
Reliability update 2023-07-14
The Build Impact number for June is higher than we’re aiming for. There were two significant incidents that drove this number. Here is more context:
- A 2023-06-15 incident impacted all customers and was tied to a service that we have been actively working to replace. While we had started the process of moving to a more robust system, we hit a capacity limit on underlying data storage much earlier than expected. Since the incident, we have upgraded infrastructure components to keep up with demand as we execute the migration.
Completing the replacement of the service in question will mark the end of a series of upgrades to our core job execution engine that has already had significant impacts for the stability of that area.
- A 2023-06-29 GitHub incident impacted the majority of our customers. While we rely on 3rd-party systems and are impacted when they are down, we were able to recover as soon as they were back online, which was a result of multiple investments in handling these significant demand surges.
The ongoing work to manage 3rd party systems and decommissioning services that don’t meet our needs will continue. But as I’ve referenced here, we’ve made some longer term investments and they are showing dividends. When discussing reliability, we don’t often talk about the problems we avoided as a result of our investments. Here are a few examples that show how when we were faced with potential disruptions later down the line, we were ready, and successfully prevented customer impact.
- After an incident we had last summer (June 2022), which was a series of API abuse attacks that had downstream impacts on our platform, we made significant investments across multiple layers of our API serving infrastructure. This return on investment was realized in February 2023, when we suffered similar abuse attempts but at 4x the scale. No engineering intervention was required and no customers were impacted.
- After an incident we had late last year (September 2022), which was a multi-hour disruption due to AWS capacity issues, we implemented a fallback approach for VM allocation that adapts to AWS instance availability. Building on top of earlier investments in more robust VM management, we have avoided similar large impacts as well as reducing VM wait time during daily fluctuations in AWS capacity.
- Protecting your ability to build is our primary goal and the separation of the critical path has been our focus. A multi-hour outage (September 2022) was triggered by a confluence of factors impacting the postgres cluster for one of our core services. Immediately afterward, we began investing in separating out historical data to protect active pipelines, starting with that service. We witnessed the same confluence of factors earlier this year and the issue was resolved in under 10 minutes and with no customer impact.
While these wins are exciting because they make our platform more reliable, we know any disruption that prevents you from building is far from ideal. We’re continuing to stay the course, and with these investments in these key areas, we’re confident we’re on the right path.
Reliability update 2023-05-31
We’re continuing to see the results that we’re aiming for, and the longer incidents impacting your build time are fewer and farther between. Today, we’ll briefly share what we’re doing and how we’re measuring success.
We’re continuing to report on the same numbers. See the previous update below for background.
One metric we are focused on is the number of extended incidents. We’re defining extended incidents as incidents lasting over 60 minutes, and focusing there because when we have short incidents, we can generally get you back to work quickly. We’re always working to do better, but we’re happy with this outcome.
Because minutes of incident duration doesn’t present the whole picture of your experience as a customer, we also measure build impact. By weighing our incident time by the amount that it impacts the most important thing that you do (building) and by tracking this number, we ensure we’re focused on the outcomes that matter. We’ve seen a solid improvement over the last few months (minus March, which we previously covered). There is more good than bad here, which signals what we’re learning and doing as a result is working.
Note: In review of our data, we uncovered previously unaccounted-for build impact in March and have adjusted this chart to reflect that.
The investments we’ve made are paying off. The work we’ve done to protect the critical path, improve our observability tools, and take a system-view has reaped a number of good months and strengthened our confidence in the way forward.
In our next update, we’ll share some specific details of the changed behaviors in our system that have resulted from our investments.
Reliability update 2023-04-14
Our focus on reliability continues. We’re making progress and are learning a lot, but there is certainly more work to be done.
How do we know? We’ve been going deep into the data to get a more nuanced understanding of where we can improve. In our last update, we committed to sharing the numbers to show how we think about reliability and what our key results have been.
While any disruption is something we’d like to avoid, we’ve been ruthlessly focused on reducing incident length, specifically those disruptions that run over 60 minutes.
Beyond that, we’re hyper-focused on making sure that, even in the case of a disruption somewhere in our system, you can still build. We’ve been protecting this critical path, and prioritizing work that minimizes customer impacts. We call this measurement “build impact.”
Here’s what the data tells us
The chart below labeled Extended Incidents represents the number of incidents that were over 60 minutes in duration. While we want to see that number near or at “0”, the trend is going down since last April. During this time, we heavily invested in our ability to respond to incidents by ensuring the entire team had a more thorough understanding of the complexity of our distributed system, so they could make decisions that were informed by this system view.
The second chart (below), labeled Build Impact, shows the impact on builds from the incidents that occured over the past 12 months, normalized against the start of the period (April 2022). On the graph, April’s level of build impact is measured as “1”, with our goal to be near or at “0.” After April 2022, we had some success by prioritizing our ability to isolate sections of our system to protect the critical path and keep you building.
Unfortunately, March 2023 was a challenging month, caused by an issue with a Kubernetes upgrade. You can read more details here.
Here’s where we’re headed
The goal is to raise our confidence level, continue limiting the blast radius of production infrastructure changes and test as thoroughly as possible. For Kubernetes, and across our infrastructure, we are improving our upgrade strategies to meet our standards for risk, and will have new paths in place before we perform another upgrade to our main production clusters.
What do I want you to take away from this? I’ve been bringing you regular updates because continuously improving our platform’s reliability is a vital part of how we deliver value to you, our customers. This work is also hard, and complex. When speaking to fellow software engineers, I know you know that.
We know that when we focus on something, we can make it better. But I’m here reporting, not celebrating. There’s more to do, and above all I want you to know we’re still dedicated to the cause.
We’ll keep doing the work; thanks for being here.
Reliability update 2023-01-30
I’ve been coming to you with reliability updates for the last four months, sharing with you our wins and challenges, and what we’ve been doing behind the scenes to ensure our platform is ready to build anytime you are.
This month, we experienced a security incident that derailed many of your teams’ abilities to do just that.
So, despite reliability being up over the last month, I want to keep this short, and leave you with two things:
- Our number one priority for this moment is to stay focused on the aftermath of the incident, working with our customers to ensure your pipelines are secure and your secrets are rotated. We’ll be doing this for as long as it takes, but once we return to business as usual, I’ll be back with a more numbers-focused update, bringing you substantive metrics you can take as indicators of the work we’ve been doing.
- As I think about systems, reliability and security actually converge: things that make us more reliable also make us more secure. So doing the work we’ve done to increase reliability (i.e. more cleanly segregate parts of our system, maintain third-party verified audit points and logging) also enhanced our ability to recover from our security incident. Both security and reliability are required for you to do your job, and our goal is to continue to raise the bar on having highly secure, highly available systems.
If you haven’t rotated your secrets yet, please do so. If you have, thank you.
Reliability update 2022-12-21
In the last couple of updates I’ve talked about some of the actions we’ve been taking to build longer-term reliability at the core of our service. In this update, I wanted to describe a bit of how we identified the underlying issues. Specifically, orienting around systemic issues, whether they are organizational or architectural.
Like many organizations, at CircleCI we have a “you build it, you run it” approach to software delivery, meaning teams are responsible for the full lifecycle of their own delivery streams. We also have shared practices for incident investigation and follow-up. However, with teams capable of managing this full lifecycle on their own, we started missing more systemic problems, thinking they were only present in the individual teams that were handling the follow-up.
Earlier this year, we turned our attention to those systemic issues by aggregating data from all of our sources. This included looking across post-incident reports, the associated historical data, and all of the follow-up work that had been done by individual teams. Inventorying all the data was important, but what stood out was what it took to get to this aggregate view needed to make changes.
Most of our tooling is oriented towards individual incidents rather than that aggregate review. Even the tools that showed data across incidents didn’t expose what we wanted. We looked at everything from where time was being spent in incident response to the classifications of causes so we could organize ourselves and our systems for highest impact.
While we were able to do most of the aggregation in a spreadsheet, much of post-incident follow-up is very narrative driven. Doing the work of structuring that historical data enough to draw conclusions was hard but has been very helpful in seeing the bigger picture. We were able to see things that weren’t clear from reading one incident report or by having a deep dive with one of our teams.
All of this work highlighted an interesting tension of maintaining our fast-moving, stream-aligned DevOps culture while bringing in a vantage point to eliminate system-wide challenges for our teams. This work provided us clearer insights into where we needed to address issues that were found in our organization or our architecture (or both). As a result, we are mitigating problems that are more systemic in nature, and providing guardrails so our teams can still move quickly and own what they build. With this aggregate view, we’re making progress on seeing and addressing points of failure before they occur.
To better building,
Reliability update 2022-10-27
Last month, I made a commitment to you that I’d bring you updates on our reliability work: how it’s going, what’s working, and where we still have work to do.
I’d like to achieve three goals through these updates:
- Reinforce our ongoing commitment to reliability,
- Share more of our reliability roadmap, so you have broader context to any future updates we share on this work, and
- Be transparent about our work and insights, so that the community can also benefit from our experiences on this journey.
In the software community, we all suffer when services go down, but we advance together by sharing our learnings.
Last month (scroll down for the full update), I shared that we’d been working on isolating parts of our platform in order to protect customers’ builds and keep them running.
Today, I want to share more detail about our current approach: system isolation in order to protect your builds no matter what. I want to emphasize that this isn’t simply something we did and reported on, but the principle of protecting builds will guide us through both upcoming reliability investments, as well as all new development work on the platform.
Like all platforms, we got here through evolution, so let me walk you through an (abridged) history of the CircleCI platform.
We started out with a monolith, like so many other companies. By default, in a monolith all your work is commingled. This necessarily creates coupling, and that leads to the potential for failures to cascade. With no separation, a failure somewhere in your codebase can lead to a failure anywhere else.
As we broke apart the monolith, we did so based on work stages, or what was happening at different points (such as workflow orchestration and job execution). That approach simplified our codebases and made delivery easier, but within those stages, we have a combination of active work and historical reporting on that work.
The work we’re doing now is to isolate at each of these stages, such that every component that is involved in running active builds can be protected from anything else.
We’re doing this work incrementally to ensure rapid results while minimizing disruptions. The first stage involved simple tools, like functions to disable historical viewing if needed. This creates a release valve.
We have also increased our use of read-only replicas for historical queries. And we are leveraging split deployments of some of these services to isolate compute resources even when the code is shared.
The next stage that we are moving into involves separating systems completely. In other words, code paths and data for real-time builds vs code paths and data for history. While replicas are helpful in distributing load, they require that all stores have the same volume of data. This can be solved with sharding, but even then you are stuck with a schema design that is trying to support both access patterns. When they are fully separate, we can optimize each design, both for scale and for product capabilities. We’re early in that approach but we’re again taking incremental steps to start realizing gains as quickly as possible.
This brings us to a good question: why didn’t we do this at the outset? Well, this is a need of scale, and doing it right out of the gate would be a mistake. Why? When you first create a platform like ours, and make early architecture decisions, I believe it’s incredibly important to make decisions that enable you to pivot quickly, and respond to the demands of your early customers. You don’t know what they’ll want, and therefore you don’t know what features your team will go on to build in order to support and delight those customers. It was possible at the beginning to imagine we’d face a project like this eventually, but I don’t think it’s possible to know which of an infinite set of scale tipping points we’d reach.
In all, what I’d like you to take from this update is that we are taking this seriously, and approaching it the way we approach all our work: incrementally, and with your needs at the core of our decisions. While we don’t want systems to break, it happens. Better isolation means we can march toward the real goal: ensuring your builds run, every day, no matter what else may be going on in our platform or the larger ecosystem.
If you’re curious about our incremental steps on this journey, check back here for updates. And if not, that’s fine too; get back to building the things that support and delight your customers, but we wanted you to know what is going on behind the scenes at CircleCI.
To better building,
Reliability update 2022-09-19
Last week, the pipelines page was unavailable for a significant portion of a day. This prevented many teams from managing their work as expected. As an engineer and as a leader, I know how important it is to stay in flow, and have your tools there when you need them. We’re sorry for the disruption caused to your team’s work.
As I stated back in April (full post follows), my top priority as CTO is reducing the length and impact of incidents at CircleCI.
But when things look like they did last week, the headway we’ve made may not be apparent.
In addition to focusing on diagnostic speed since our original post, we’ve begun investing in protecting your ability to get work done (namely, run pipelines), even when things break. While our work is still in progress, we’ve made some key gains. But if you can’t see or feel the impact of this work so far, then we’re not succeeding. Not as a technical team, and not at creating the trusting relationship we want to build with you.
One gain worth noting is the work the team has done to begin segregating parts of our architecture. This lets us constrain incident impact and protect your pipelines when things otherwise go sideways. It allowed us to do things like temporarily shut off bits of the UI in order to make sure that pipelines could still run, which is what we did last week. But we didn’t share that with you. Instead, you saw that the site was down, and reasonably assumed that nothing had changed.
Again, we have more work to do here, and we remain deeply invested in it. We can’t stop things from breaking, but we can continue to find new ways to ensure your builds can always run, and give you better and more timely information about how to accomplish your work, even when Plan A fails.
Additionally, it’s been 5 months since our last reliability update, and we can do better. Going forward, I’m committing to updating you on our new reliability developments. I welcome your feedback in the meantime.
Reliability update 2022-04-13
At CircleCI, our mission is to manage change so software teams can innovate faster. But lately, we know that our reliability hasn’t met our customers’ expectations. As the heart of our customers’ delivery pipelines, we know that when we go down, your ability to ship grinds to a halt as well. We’re sorry for the disruptions to your work and apologize for the inconvenience to you and your team.
What’s been happening
No single part of our platform or infrastructure is at fault for recent outages. Instead, we’ve seen a mix of sources of issues, from bug-causing updates to dependency issues, and upstream provider instability. The January update to our pricing plan brought increased traffic and usage to our platform. While we planned and modeled for this, it has contributed to us reaching inflection points in some of our systems.
While there is no clear pattern in the cause of recent incidents, we know our overall time to resolution has been too long. Diving into our incident response protocol has helped us uncover places where our team execution under pressure has not helped us. We fully embrace blameless engineering culture and the DevOps principle of “you build it, you run it,” but the distributed nature of both our system and our teams has made that connection, communication, and resolution difficult.
Why? Over the past 12 months, we’ve nearly doubled our engineering team. This growth has been intentional and provided some incredible velocity - last week alone we deployed over 850 times. But that growth also means our base of intuitive knowledge has become less central and cohesive. We need to rebuild both broad and deep systems understanding across all of our teams.
What we’re doing to move forward
For us, technology is all about people, and improving our reliability will take a people-first approach. As of last week, we’ve created a tiger team of on-call first responders, including myself, on on-call rotation. This is a global team of individuals empowered to both fix things quickly and effect long-term change through both process and technology. Our goal is to strengthen the impact of engineers who can drive an incident from identification to resolution and then help share insights with the larger team.
Historically, we’ve focused our reliability efforts on system “hot spots” that were known sources of downtime, including fleet management and machine provisioning. We’ve made deep investments there that have paid off. But as our organization has grown, our issues have been less about service-level disruptions, and more about the complex interactions of a large distributed system. Our goal for this tiger team is to get you back to working as quickly as possible, then use what we learn to resolve the underlying causes of those incidents.
We’re also making investments to our platform to build and rebuild with the future in mind. We recently hired a new chief architect to lead our efforts in platform scalability and building for long-term product innovation.
How you will know we’re making progress
While it would be unwise (and unbelievable!) to promise that we will never have another incident, we can commit to making them less of a burden for our customers.
As we continue to invest in our long-term platform stability, our short-term focus is on reducing incident length. For incidents where customer impact exceeds one hour, we commit to publishing an incident report on status.circleci.com.
As CTO, improving incident response is my top priority. We know we have work to do here, and we’re confident that the plans and team we have in place will help us make immediate improvements. Thank you to our customers and community for your ongoing support and patience.