Cloud provider outages are a common and unavoidable issue for us at CircleCI. One of the most common causes of boot failures for our customers is availability zone degradation, which can happen for any number of reasons, including network errors, operations failures, and zones hitting their capacity limit.
For customers using our machine executor or remote Docker features, we use our cloud providers to spin up a brand new virtual machine (VM) for each executor in the build, and then throw it away when the build is done. Because these VMs take a couple of minutes to boot, we maintain a pool of pre-booted VMs so that builds can start with little to no wait time. This is convenient when our cloud providers are firing on all cylinders, but less so when they run into failures that then affect the wait time for our customers.
While we do the best we can to minimize the impact of zone failures on our customers and engineers, no amount of metrics, alerts, logging, and documentation can ever solve the problem completely. While cloud providers don’t typically consider it a breach of their respective SLAs if a zone goes down, we seek to go above and beyond that level of accountability when offering compute to our customers. So, in an effort to improve our fault tolerance in the face of these inevitable issues, we decided to come up with an in-house solution.
It’s 11:23 a.m. and I’m in the zone, chasing down a bug that’s been bothering us for over 2 hours. All of a sudden, my phone goes off. It’s PagerDuty. “Task wait time excessive for resource type:medium image:ubuntu-1604:201903-01”. I sigh, tap “acknowledge,” and read further into the alert message.
I open the dashboards to match the symptoms against the runbook. Then, I assess customer impact - this outage looks severe. I head to our #incident Slack channel to update the team. I work to mitigate the issue and monitor the recovery. Eventually, wait times drop and the incident is resolved. Great, now what was I working on again?
Unfortunately, this type of interruption is inevitable when you rely on cloud providers. Not only would this interruption increase wait time for our customers and take up engineering resources to fix, but it would cause us to fall behind demand for VMs, and customers’ builds would be stuck waiting in a queue for even longer.
In our early attempt to solve this problem, we established a threshold for customer wait time. Once wait time passed above that threshold, a Datadog alert would trigger and an engineer would get paged. From there, the manual process of diagnosing and fixing the issue was effective but inefficient. The engineer needed to go into the runbook, figure out which zone was degraded, configure changes, and then deploy them. After removing the unhealthy zone, they would also need to manually boot VMs to catch up on the backlog of builds. Finally, once the team was caught up on the backlog, an engineer would also have to remember to re-enable the zone once it had recovered.
Even though our system was tolerant of some failures (like a machine that failed to boot or an API that wasn’t cooperating), most of the alerts that we received required an engineer to diagnose the issue. Despite our efforts to trigger alarms and pages earlier, isolate failures by data center and zone, and even track the lifecycle of the provider machines to help us predict outages ahead of time, we still had to manually reroute traffic away from the degraded zone once we knew there was an issue. This left customers waiting in a queue, and engineers scrambling to get ahead of the issue.
Instead of reacting to zone failures and slow build times, we decided to find a way to proactively mitigate the impact on our customers, even though we couldn’t control the health of our cloud provider’s zones. We wanted a solution that would automatically detect when a zone was unhealthy, stop spinning up VMs in that particular zone, and automatically add it back into our rotation as soon as the zone was healthy again.
We landed on the circuit breaker design pattern as the best solution for this problem. In this pattern, a circuit breaker is used to control access to an external resource - in this case a provider availability zone. When the success rate of interactions with the provider zone falls below a certain threshold, the circuit breaker ‘opens’ and access to the resource is cut off, rerouting traffic to another zone. After some time passes, a small number of interactions are permitted to determine whether the success rate has returned to normal. If so, the circuit breaker ‘closes’, allowing access to the resource once again. Otherwise, the circuit breaker goes back to its ‘open’ state and repeats the test until the zone is healthy again.
Typically, circuit breakers are implemented in-memory. But, the in-memory pattern wasn’t a viable option for us. In our system, after making the API call to boot up a VM in a provider-zone, we put a message on a queue so that another worker (running in a different process) could wait for the VM to boot, SSH in, and prepare it to run a customer build. This meant that the process that needed to know which zones were healthy was different from the one that needed to know whether VMs were successfully booting in those zones, making an in-memory circuit breaker unsuitable for our problem.
We realized that we needed a centralized circuit breaker implementation instead. We stood up our circuit breaker service to both track zone success rates and to manage the state of each circuit breaker. This made it possible to aggregate provider-zone success rates across all nodes in our VM provisioning service cluster. By centralizing our circuit breakers into the service, we are also able to manually set the circuit breaker state in one place, if necessary.
The automated circuit breaker system has made noticeable improvements to customer build time, has freed up time for our engineers to work on other important projects, and has ultimately saved us resources by reducing the need to set up pre-booted VMs in anticipation of zone outages. Much to the relief of the engineering team, the new system has done away with the need for pages due to this issue and has improved the overall response time to zone outages by about 20 minutes per incident.
The circuit breaker has worked so smoothly since implementation that when a zone became degraded during the week of our Windows launch, we didn’t even notice. The circuit breaker automatically rerouted traffic as needed - reducing any excessive wait time that customers would otherwise have experienced, and freeing up the team’s time to focus on the launch. The circuit breaker has not only improved performance for our customers, but it acts as a fail-safe, giving the team added confidence that wait times will not increase due to a lack of VMs in a degraded zone.