When CircleCI customers want to kick off a new build, pre-warmed, pre-provisioned VMs are ready to serve, with all dependencies already installed, 7 days a week, 24 hours a day. “It’s invisible to customers,” says Cayenne Geis, a software engineer on CircleCI’s Machine team, “it’s the sort of thing [our customers] just expect to just work.”
But how do we manage an always-ready fleet of VMs, when demand constantly fluctuates throughout the day and week? Or, for example, how do we quickly shift resources when there’s an Xcode image update, and we need to quickly re-allocate image versions to match usage?
The answer to those questions has changed over the last year, in a way that has improved the experience of running builds on CircleCI for customers, but which might not be immediately noticeable to those using the service. Previously, we had a process that involved manual adjusting of scale levels for each combination of resource and base image. We maintained a fixed number of machines, but our customers’ needs fluctuated greatly. When demand spiked, our team was called into action, making adjustments in number and type of machine available. When demand dipped (on weekends, for example) we were paying for VMs that sat unused.
The only thing worse than machines sitting idle were the times that our engineers had to work furiously to keep up with shifting demand. “The pool size was flat, but the rate at which tasks were arriving varied,” explains John Swanson, Senior Fullstack Engineer on the Machine Team. “And any sign of queuing led to paging engineers.” While it was important for us to keep our availability high and queue time low, we also realized we needed a way to reduce the manual toil on our engineering team, freeing their capacity for higher value work. How could we build a system that could scale up and down in response to customer need?
“We had assumed it was going to be some really complicated formula,” Geis continues. “We all went away and started researching. [One of our colleagues], Marc, came back and said, ‘Why don’t we try this simple formula?’ …And it actually worked really well.”
Using classic Queueing Theory, and data gathered from our VM service, the team was able to plug sample data into different Queueing Theory models in a spreadsheet. We were curious to see what the different models would generate: how many VMs would need to be pre-booted in each?
As it turned out, the Queuing Rule of Thumb did an incredible job at predicting the values we already had in the manual configuration of our VM scaler. That gave us the confidence to try applying it incrementally. Starting in August of this year, we did just that.
Very quickly, we saw reduced queuing at peak times from as high as 3 minutes to under a minute, as well as a decrease in idle machines at non-peak times. We slowly transitioned our system from manual scaling to auto-scaling, and peak wait times went down by as much as 60%, and more closely matched the wait at non-peak times.
At the same time, introducing auto-scaling allowed us to significantly reduce the effort required to maintain each individual resource class on our platform from 1h/month/resource to effectively zero. In doing this, we were able to free up our engineers’ time to focus on other work.
Customers started reporting more consistent job start time regardless of time of day or week. But even more important than that, having automated scaling allowed us to greatly expand our catalog of resource types available to our customers.
“Previously, allocating the right number of machines required constant updating for every pair of image-to-resource,” explains Alexey Klochay, a Product Manager on the Machine team. “As we expanded the amount of [machine types and resource classes] we offered, the load on the team was increasing. So we invested in team efficiency, and this led to more options for more varied customer use cases.” As a concrete example, our new autoscaler allowed us to ship our latest resource classes, including Windows, on a much shorter timeline.