Simulating Auto Scaling Build Clusters Part 1: The Mathematical Justification for Not Letting Your Builds Queue
Over the past 5 years, CircleCI has run millions of builds on our rapidly growing fleet of build containers. We’ve tweaked and tuned a lot of parameters to get our huge, auto scaling cluster of servers to always have capacity to run builds while minimizing unused resources.
A new kid on the block
Lately though, after the launch of CircleCI Enterprise, we’ve had lots of customers asking us how to run their own builds in AWS Auto Scaling groups or similar services. Does auto scaling make sense for them? What parameters will keep their CircleCI instance responsive while minimizing idle server time? The problem is, these customers don’t want a several-thousand-container cluster to process thousands of builds per hour from a globally distributed pool of developers. They want a few hundred containers to process dozens of builds per hour from a couple buildings’ worth of developers. They also use different machine types, different container specs, and have different traffic patterns from circleci.com. We’re experts at running a single giant-sized cluster, but now we need to learn to be consultants for the runners of dozens of heterogeneous, mediumish-sized clusters.