At CircleCI, our mission is to manage change so software teams can innovate faster. But lately, we know that our reliability hasn’t met our customers’ expectations. As the heart of our customers’ delivery pipelines, we know that when we go down, your ability to ship grinds to a halt as well. We’re sorry for the disruptions to your work and apologize for the inconvenience to you and your team.
What’s been happening
No single part of our platform or infrastructure is at fault for recent outages. Instead, we’ve seen a mix of sources of issues, from bug-causing updates to dependency issues, and upstream provider instability. The January update to our pricing plan brought increased traffic and usage to our platform. While we planned and modeled for this, it has contributed to us reaching inflection points in some of our systems.
While there is no clear pattern in the cause of recent incidents, we know our overall time to resolution has been too long. Diving into our incident response protocol has helped us uncover places where our team execution under pressure has not helped us. We fully embrace blameless engineering culture and the DevOps principle of “you build it, you run it,” but the distributed nature of both our system and our teams has made that connection, communication, and resolution difficult.
Why? Over the past 12 months, we’ve nearly doubled our engineering team. This growth has been intentional and provided some incredible velocity - last week alone we deployed over 850 times. But that growth also means our base of intuitive knowledge has become less central and cohesive. We need to rebuild both broad and deep systems understanding across all of our teams.
What we’re doing to move forward
For us, technology is all about people, and improving our reliability will take a people-first approach. As of last week, we’ve created a tiger team of on-call first responders, including myself, on on-call rotation. This is a global team of individuals empowered to both fix things quickly and effect long-term change through both process and technology. Our goal is to strengthen the impact of engineers who can drive an incident from identification to resolution and then help share insights with the larger team.
Historically, we’ve focused our reliability efforts on system “hot spots” that were known sources of downtime, including fleet management and machine provisioning. We’ve made deep investments there that have paid off. But as our organization has grown, our issues have been less about service-level disruptions, and more about the complex interactions of a large distributed system. Our goal for this tiger team is to get you back to working as quickly as possible, then use what we learn to resolve the underlying causes of those incidents.
We’re also making investments to our platform to build and rebuild with the future in mind. We recently hired a new chief architect to lead our efforts in platform scalability and building for long-term product innovation.
How you will know we’re making progress
While it would be unwise (and unbelievable!) to promise that we will never have another incident, we can commit to making them less of a burden for our customers.
As we continue to invest in our long-term platform stability, our short-term focus is on reducing incident length. For incidents where customer impact exceeds one hour, we commit to publishing an incident report on status.circleci.com.
As CTO, improving incident response is my top priority. We know we have work to do here, and we’re confident that the plans and team we have in place will help us make immediate improvements. Thank you to our customers and community for your ongoing support and patience.