On July 19, CircleCI faced a sitewide outage which left thousands of teams unable to test and deploy builds for the better part of a day. This outage affected the productivity of many development teams, and surely caused a few missed deadlines. We value the trust our customers place in us and are deeply sorry for the effect this had on their work. For those interested, we want to give more details as to what happened, why it happened, and what we’re doing about it.
On 2018-07-19 at 16:46:14 UTC, the root certificate for one of CircleCI’s internal Certificate Authorities (CAs) unexpectedly expired, causing all certificates issued by that CA to be simultaneously invalidated. The CA in question was used for all instances of Mongo running in the CircleCI environment (a total of 17 hosts). Using an internal CA requires that the root certificate be distributed to every client in order to establish a trust relationship. Due to the scope of impact, hundreds of hosts and containers were affected. In order to avoid rewriting the certificate stores of every client host, we generated 3rd-party certs for all of our Mongo servers and reconfigured the servers to operate with these certs.
How did we get here?
During the summer of 2015, before CircleCI had hired its first SRE team member and when our 3rd-party Mongo hosting provider was consistently failing us, we started to build out an internally-managed set of Mongo servers that we could scale to meet our needs and manage with tooling that we felt was more appropriate for the service we were working to offer. While we already had our sights set on moving off of Mongo, we also knew that we’d likely be using it for a long time to come. In early August, 2015, while building out the first of what would eventually become 5 independent replica sets, we created an internal Certificate Authority to start issuing certificates for our replica sets. This certificate was generated with a 10-year validity period in order to avoid some of the challenges of dealing with rotating internal certs on critical infrastructure. This decision was made at a time when tooling to easily rotate certificates was not as readily available as it is today.
In late September, 2015, as we were getting ready to start migrating traffic to our new Mongo clusters, a reviewer identified that the root certificate used to sign our server certificates was using SHA1 for the signature algorithm. This certificate was immediately replaced with a new one created to use SHA512 as the signature algorithm. The intent was that this certificate would also have a 10-year validity period. However, as we discovered recently, this was not the case. While we don’t have access to the machine and therefore the original command issued to create this certificate, it is clear that operator error led to the root certificate being generated with a shorter validity period.
All of the server-specific certificates that we generated and signed with this root CA were created with 10-year validity periods, and those were the ones that we inspected to ensure that we were not in any hurry to replace them.
What are we changing to avoid this happening again?
The majority of the work to avoid this type of incident from happening is work that we have already done. In the time that has passed since our Mongo infrastructure was deployed, we have gone from 0 to 9 SRE team members, investing heavily in our operational infrastructure. We now use automated tooling to manage our internal certs, as well as multiple 3rd-parties to handle our overall TLS certificate load. We also monitor certificate expiry on all the systems that have been subsequently built.
Unfortunately, we allowed the migration of our certificate management on our Mongo infrastructure to be deprioritized based on the understanding that we had a rather large window to handle a migration of tooling. We are currently auditing all of our uses of TLS across our entire infrastructure to identify any configurations that are not using the latest standards of our tooling. We are also looking at any other non-standard configurations on older system infrastructure in order to get all of our systems up to our current level of monitoring and maintenance tooling. Specifically relating to the Mongo servers in this incident, we replaced the certificates from the internal CA with externally provided certificates with a 1-year validity period and will be actively working to migrate them to our common infrastructure now that we are clear of the incident.
What else did we learn?
We have an engineering team that places a lot of value on peer code reviews and automation to catch errors before they get to production. This is one of the reasons we lean so much on infrastructure as code. But even at a time when some of these tools were not as well developed and we hadn’t gotten some of them into place, it still would have been possible to get peer review in a manual fashion. We do this quite frequently when taking actions that have risk. A simple “I’m about to do this thing with this command” to get a “LGTM” from a second set of eyes is always worth it.