TestingJul 15, 20217 min read

Safely changing critical systems without downtime

Conor McDermottroe

Senior Staff Software Engineer

No Downtime - v1

How to change critical systems without downtime

Confidently testing in production is crucial for engineers to deliver software quickly. In this blog, we’ll discuss what our team at CircleCI has learned while changing our critical systems.

The challenges that came with permission checks

Whenever CircleCI is used for anything, we check to ensure that action is authorized. Sometimes we need to do multiple checks for a single action.

For example, if you use our API to start a new workflow we need to check both that you’re allowed to start workflows on that project and that you’re authorized to use any of the contexts used by that workflow. This adds up to a lot of permission checks. We perform thousands of these checks every second and we need to get the right answer as fast as possible every time.

The original implementation of these permission checks was based purely on GitHub’s access model. If we needed to check if you could start a new build, we would check if you could push to the corresponding GitHub repository.

Over time, this grew to include layers of caching, retries, Bitbucket support, and many other additional complexities, including:

  • The implementation logic was split over two subsystems so we had to know which one to use depending on the type of permission check we were doing. This was the result of a previous attempt at a refactor that was abandoned partway through due to competing priorities.
  • CircleCI used to be a monolithic service and one of the implementations was still inside that codebase. This made it very difficult for newer services to perform authorization checks.
  • Since many of the permission checks rely on calling Bitbucket or GitHub APIs we were unusually vulnerable to performance or availability issues with those APIs. Two independent implementations didn’t help here either, because their behavior during outages were inconsistent.
  • Some checks relied on a “user profile” which was an expensive-to-build cache of everything we knew about a user from both their GitHub and Bitbucket accounts. This resulted in negative long tail performance. If you were building lots of repos from many different Bitbucket/GitHub organizations, you either got an instant result or a timeout waiting for the profile to build.
  • We wanted to allow users to explicitly grant any permission and not be tied to their Bitbucket/GitHub access but only one of the subsystems supported doing that.

Performing a migration without negatively impacting customer experience

We decided to complete the previously attempted refactor by moving any checks from the monolith into the other subsystem. Simultaneously, we cleaned that up to remove uses of the “user profile” in order to increase reliability and performance. Performing that migration had to be done without negative impact to customer experience so we did it slowly and carefully over the course of approximately a year.

Step one: add logging and gather metrics

Our first step was to create a façade over the two permissions implementations and we forced all checks to be performed via that façade. This gave us a place to add logging and gather metrics so that we could measure the baseline performance and availability of each permission check. More importantly, it allowed us to switch between implementations of each permission check without needing to change the calling code.

Step two: migrate one permission check at a time

The next step was to migrate each permission check into its final destination. We were extremely cautious about these migrations because we had to meet some stringent criteria:

  • The new implementation must return the same result as the old implementation. We could not afford to accidentally grant someone access to something they should be prohibited from using.
  • The new implementation must be at least as fast as the old one. Not only must its average latency be the same or lower, but its long tail latency should remain within acceptable limits. Mean, median, 95th percentile, and 99th percentile timings were compared in order to ensure that performance was acceptable.
  • The new implementation must have similar or better availability than the old implementation. The behavior during Bitbucket and GitHub outages must be predictable and the impact of those outages must be minimized whenever possible.

How to do individual permission migration

For each individual permission migration (from implementation A to implementation B) we followed this process:

  1. Status quo ante. We called A for all checks.
  2. Toe in the water. We called A for all checks, but in a background thread, we also called B for 1% of the checks. We compared the result and recorded whether it produced the same answer. We also recorded performance data about both A and B.
  3. Slowly turn it up in the background. We continued to serve all the answers from A, but we ratcheted up the percentage of checks done via B in the background. At this point in time, we were looking for obvious bugs in the implementation and also ensuring that B could safely take the load of 100% of the requests.
  4. Find and fix the remaining bugs in B. While running B at 100% in the background, we checked any disagreements and fixed the bugs that caused them. This is where we were able to do some quick iteration because we were iterating on B which is still in the background, invisible to customers. Our feedback was almost instant after deployment and yet there was almost no risk of breaking things if we made a mistake, so you could afford to move fast until we got to 100% agreement between A and B.
  5. Slowly deploy it for real. We stopped running the check-in in the background with B. Over a period of time, we sent X% of traffic to A and Y% to B, slowly reducing X and increasing Y%. We knew the implementations were now equivalent but we wanted to be careful to not abruptly cause customer-visible changes in cases where the performance differed.
  6. Leave it to soak. We let B run at 100% for a while so that we could validate its performance and behavior over a full load cycle. This allowed us to build more confidence that we hadn’t missed any extremely rare bugs.
  7. Remove the old implementation. Clean up by removing A.

The choice of which implementation to use and/or what percentage to run in the background were all driven by toggles that could be changed instantly. Even up to step six above we could revert instantly to our original implementation with a single command.

Step three: clean up security permissions

The final step was to clean up the code used for the switchover. The measurement and logging code could stay because it remained valuable to monitor the behavior of our permission checks over time but we were able to remove the code used for switching between implementations. This simplification is important because it allows new engineers to become familiar with the code more quickly.

Lessons learned from testing in production

Testing in production has provided our team with many insights that have helped us deliver better software, faster:

  1. Moving slowly and carefully can improve overall development velocity. When engineers have confidence in their ability to make changes without customer-facing consequences they can iterate quickly.
  2. Launching something new in the background, invisible to customers, is a powerful tool for testing in production. We can build our confidence with good design and pre-production testing but there are always unknown unknowns and those can only be discovered in production. If you can do that without customers noticing, you can deal with them before they cause any meaningful impact.
  3. Feature toggles are the best way to revert bad changes. Services take time to roll back to a previous version and reverting a change in source control and redeploying is even slower (and it makes your git history harder to follow!).
  4. Good telemetry is essential. We were able to use billions of data points to build confidence in our changes and produce very stable timing distributions.

To learn more on this topic, download our ebook, A rational guide to testing in production.

Copy to clipboard