In the last post in this series, I introduced a computational simulation of build clusters and used it to determine optimal build fleet sizes taking both machine costs and developer waiting costs into account. The ultimate goal is to learn how best to manage an auto scaling build fleet, but I started with the case of a fixed-size fleet to get a baseline to improve on. Today the auto scaling begins!
Reviewing fixed-size fleets
The cost of developers waiting for queued builds caused the cost model I introduced last time to always prefer near-zero queueing. This was accomplished by making the build fleets large enough to absorb any spikes in traffic, sometimes by a surprising amount. For example, five-minute builds triggered at an average rate of fifty per hour, or an average load of 250 build-minutes per hour, could ideally be handled by less than five machines if the builds were timed perfectly. The model recommended twelve machines for this traffic pattern though. Here is a plot of the simulated usage of such a cluster.
The queueing associated with these spikes “breaking through” would quickly undo the savings that could be gleaned from a smaller fleet by wasting expensive developer time. The twelve-machine fleet is needed to handle the random spikes in traffic, but this results in a lot of machines sitting idle.
Let’s auto scale!
Hopefully instead of just leaving all this extra capacity lying around in case a spike in usage happens, we can use auto scaling to resize our fleet dynamically. The way auto scaling works (at least as implemented with AWS Auto Scaling groups and CloudWatch Alarms) is that there is some metric that you want to hold within some acceptable range–in this case we’ll use the number of available builder machines–and whenever the metric strays outside of that range, the auto scaling mechanism responds by adding or removing machines to the fleet as needed.
While the promise of a continually resized fleet is exciting, it introduces a lot more parameters to adjust compared to a fixed-size fleet. Instead of just a number of machines, we can now vary the thresholds (high and low) that trigger scaling, the frequency with which the metric is sampled (the CloudWatch Alarm period in AWS), the number of periods in a row the metric needs to be past the threshold for scaling to happen, and how many machines to add or remove at once.
The volume of builds and their durations will continue to influence the performance of the cluster, but now we also need to think about the time it takes to add builder machines to the fleet–basically the boot time–as this determines the time lag associated with accommodating sudden changes in traffic.
To the cloud!
All of these new settings result in a huge, many-dimensional parameter space over which we need to minimize our cost model. I started to map out the space with a handful of traffic patterns that I modeled with fixed-size fleets and used them as upper bounds for scaling thresholds (because they already did a good job of keeping queues near zero), then I varied boot times and the rest of the auto scaling parameters over reasonable ranges. This resulted in a few hundred thousand combinations of settings and several thousand years of simulation time to run. After some cursory profiling and performance enhancements, this turned out to be about a 12-hour job for my laptop.
This is where I could have spent a lot of valuable time looking for some fancy data science to streamline the search, but instead I decided to follow my own advice from the last post regarding the cost of human time vs. computer time and delegated to several dollars-worth of c4.8xlarges. I just had to write a bit of code to serialize the inputs and outputs to my model into hunks of JSON that were readily rsyncable with EC2 machines, and I was off to the races.
After analyzing reams of cloud-enabled data, the prospect of saving much money on a build cluster with realistic parameters was grim. Let’s first look at one of the more predictable results. If you want to be able to auto scale based on quickly changing metrics, fast boot times are key.
The savings we get from auto scaling plummets as the boot time of builder machines increases. If you look back at the spikey build traffic graph at the top of the post, this should make sense. If we launch machines in response to a spike in traffic, and they aren’t ready until after the spike has passed, then they won’t do any good.
Next let’s look at the best auto scaling outcomes relative to the length of each build.
Sadly we only get compelling savings for very slow, infrequent builds. Keep in mind that this model is idealized in a lot of ways, so predicted savings of 10-15% are likely not to materialize in the real world.
We can look at the same general picture by looking at savings vs. build frequency while holding build length constant.
Again, we only see compelling savings for slow, infrequent builds. Looking at the fleet capacity and usage over time can help us understand why autoscaling only seems to help for this scenario with slow, infrequent builds.
The builds are slow enough that the fleet is able to scale up to handle new traffic over the course of a single build, and builds are infrequent enough that it’s unlikely for a bunch of builds to be kicked off in the middle of scaling.
The timing just isn’t right…
Unfortunately for our auto scaling project, most organizations don’t trigger only two builds per hour, with builds that last forty minutes each. In fact either of those numbers is probably indicative of an unhealthy software team.
An organization running CircleCI Enterprise is more likely to have dozens or hundreds of builds per hour, meaning an average of one or two minutes between builds at most. Each build should also take only a few minutes at most. The fundamental problem is that boot times on the order of minutes are just too slow to scale in response to the minute-to-minute random traffic fluctuations of most build fleets. Even a boot time of one minute would be too slow under heavier traffic, especially when combined with the added delays of the AWS CloudWatch periods and Auto Scaling groups.
If we were talking about a cluster of containers that could boot in seconds instead of VMs that boot in minutes, that would change things dramatically. That could be a game changer in the coming years, but at the moment most organizations don’t have such a container cluster. There is also the issue that to be able to scale, a container cluster would have to have some idle or low-priority capacity to scale “into”. A pool of containers running offline ETL jobs or other pausable work would be a great candidate to kill off in favor of builds (the canonical example from Google is search indexing jobs), but again, most organizations aren’t there yet. (Also, note that CircleCI Enterprise and circleci.com both use containers to maximize resource usage within the build fleet, but the compute power still needs to come from some underlying pool of VMs.)
There is another critical factor that I’ve ignored so far that the astute reader is probably already thinking about. Everything in the model so far has made the major simplification that the timing between builds is random but occurs at some constant average rate. Of course this doesn’t actually describe real build clusters, because real programmers don’t push code at 4am, or at lunch time, or on New Year’s Day (unless they’re total 1337 10Xer ninjas). These kinds of hourly or daily traffic fluctuations are much more manageable with auto scaling even if booting builder machines takes two, five, or even ten minutes, and we’ll learn all about it… next time!