Over the past 5 years, CircleCI has run millions of builds on our rapidly growing fleet of build containers. We’ve tweaked and tuned a lot of parameters to get our huge, auto scaling cluster of servers to always have capacity to run builds while minimizing unused resources.
A new kid on the block
Lately though, after the launch of CircleCI Enterprise, we’ve had lots of customers asking us how to run their own builds in AWS Auto Scaling groups or similar services. Does auto scaling make sense for them? What parameters will keep their CircleCI instance responsive while minimizing idle server time? The problem is, these customers don’t want a several-thousand-container cluster to process thousands of builds per hour from a globally distributed pool of developers. They want a few hundred containers to process dozens of builds per hour from a couple buildings’ worth of developers. They also use different machine types, different container specs, and have different traffic patterns from circleci.com. We’re experts at running a single giant-sized cluster, but now we need to learn to be consultants for the runners of dozens of heterogeneous, mediumish-sized clusters. Thinking about this problem dusted some cobwebs off one of the volumes in my mental math bookshelf, and it occurred to me that there might be a way to model the problem with a bit of queueing theory. The simple case of a single server running jobs of a fixed duration occurring randomly in time (following a Poisson distribution) is known as an M/D/1 queue, and there are simple closed-form solutions to the expected wait time in such a queue. The multi-server (M/D/c) case has much more complicated closed-form solutions, but adding auto scaling quickly makes the problem intractable. Or at least I didn’t find an answer after several minutes of spirited googling, which is practically the same.
An experimental approach
This is where I could’ve built some kind of elaborate computational simulation of auto scaling build clusters and run millions of iterations of Monte Carlo experiments with hundreds of thousands of possible parameters, so of course that’s what I did. It lives at bellkev/asg-sim, and I’ll be using data from that model in the rest of this post. I should also point out here that while I’m focusing on modeling build clusters as used in CircleCI Enterprise, much of what I cover will also apply to selecting a plan size on circleci.com or even to managing clusters of servers running arbitrary jobs.
Before tackling auto scaling, I spent a bit of time on the fixed-fleet-size case, because that is the basis of comparison for all auto scaling solutions. After verifying that my model matched the theoretical predictions for the simple M/D/1 case, I started on the problem of balancing queue time and resource utilization. It’s obviously easy to do a good job of just one of these things. You can underprovision and keep your queue full and machines busy, or you can overprovision and keep your queue empty and your machines idle. The following graph makes this clear.
This graph shows what happens when the same traffic pattern is run against fixed-size clusters of varying sizes. Clearly smaller clusters have higher machine utilization with long queue times and larger clusters have low utilization with short queue times.
A cost function
In order to choose some kind of balance point, we need to do a bit of hand-wavy math. Developers’ time is valuable, so there should be some kind of dollar value that can be placed on time they spend waiting for builds in the queue. Big servers are also expensive, and there is a cost associated with keeping them running. You can model the total cost something like this:
cost = idle machine hours × cost per machine hour + queued build hours × cost per developer hour
You may be philosophically opposed to the idea of equating human hours to machine hours, or you might argue that five minutes waiting on a build really causes more like thirty minutes of lost productivity because of the interruption it causes. You might even think that queueing time doesn’t really matter, because developers can go get coffee or work on some other feature or something while their builds run. I would argue that this is absurd, because if developers have to take an arbitrarily long break or context switch after each push to CI, then what kind of unmeasurably bad outcomes are you incentivizing? We only need to get the cost parameters accurate to within a factor of ten or so, so don’t worry about it too much. I’ll be using $200 (maybe a little high, but interruptions are expensive) per hour of developer time and AWS m4.large on-demand rates for machine hours unless otherwise stated.
A final point worth making before we proceed is that slow builds are still exactly as bad no matter what. Using the parameters above, the cost of actually running a build looks like this:
build running cost = build run time × (cost per machine hour + cost per developer hour)
Obviously this is terrible, because it takes both compute resources and developer time, but this number only depends on the build run time, number of builds, and other cost parameters. You should absolutely work hard to minimize this cost, but it’s a separate problem from optimizing away the waste cost of under/overprovisioning your build cluster.
Running the numbers
Okay, now that we’ve defined a reasonable cost function for our model, we can look at the fleet size experiment from before in terms of a cost curve. (The dollar amounts are basically arbitrary. They’re big because I ran the simulation for the equivalent of months with hundreds of servers for each data point to make the graph nice and smooth.)
The curve is obviously not symmetric, but that makes sense if you think about it. In queueing theory, a queue that gets added to faster than it can it can process jobs is considered “unstable” and quickly grows to infinity. Even stable queues will occasionally grow a bit due to random spikes in traffic, leading to some average queue length, but that average queue length blows up asymptotically as the system approaches the unstable limit. On the other side of the curve, there is a simple linear growth in cost as we add builder machines to the fleet.
If we produce one of these curves for various traffic patterns, we can come up with optimum fleet sizes for those workloads by taking the minima. Here’s what we get for 5-minute builds getting triggered at various frequencies.
Now that we have optimum fleet sizes in terms of cost, let’s see what queue times and levels of machine utilization are considered optimum for a fixed-size fleet.
Look at that! The queue times are virtually zero, even if it requires keeping machines at under 20% utilization for smaller fleet sizes. Remember, this is even in the model where we treat developers as simple, swappable, fixed-price cogs in the machine. It still clearly pays off to keep queue times low. Also notice that you get some economy of scale as the traffic and fleet size grow.
The picture is similar if we look at varying build times at fixed number of builds per hour.
Only this time we get increased utilization for slower builds. This is a bit of a silver lining around the dark cloud of slow builds, but obviously not something to strive for.
Is this for real?
There are certainly valid possible objections to the conclusion so far. For example, what about a team that runs builds on expensive hardware, lives in an economy where developers are a bit cheaper, and doesn’t consider waiting on builds to be “that bad”? Even if my assumptions are orders of magnitude off, each build requires the equivalent of two m4.10xlarge machines at on-demand prices, and developers earn starvation wages, the result is nearly the same:
Even with super expensive hardware, letting builds queue for more than 10 seconds isn’t cost effective unless developers make about ten dollars per hour. Letting builds queue for more than a minute is equivalent to valuing your developers’ time at less than a dollar per hour, and letting builds queue for more than ten minutes only makes sense if developers make pennies per hour.
“Letting builds queue for more than a minute is like valuing your developers’ time at less than a dollar per hour.”
My theory as to why such queueing is still commonplace is that most organizations don’t look at the cost of their build infrastructure side-by-side with the cost of their developers. Instead, build machines are likely an expense that some team lead is incentivized to reduce, while salaries are more opaque and decided at a higher level. The productivity loss due to queuing builds is hard to measure or quantify, so it gets swept under the rug as one more excuse of lazy, complaining developers.
Fortunately the results of the experiment so far can be distilled down to some very simple advice. If you need to decide how big to make your fixed-size build cluster (or circleci.com plan, which is logically equivalent), set it at whatever size leads to zero queueing. When I say zero, that means less than a second on average for most workloads and under about 10 seconds on average even if you run your builds on expensive hardware (or high parallelism on circleci.com). You will enjoy greater machine utilization as traffic (or build run time) increases, but you shouldn’t worry about that. Always target near-zero queue time.
I should also highlight that this simulation does assume that the timing between builds is random, and not influenced by the queue time itself. In a team with chronically queuing builds, it’s probably common for developers to avoid pushing or skip CI when the queue is long. But again, is that behavior that you want to happening in your team? Probably not!
Hold on though, what about auto scaling? Isn’t it supposed to save the day and let us achieve web scale? Well, my fingers are tired of blogging right now, so I’ll have to leave all the exciting data on that for the next installment. Stay tuned!