It’s been a year since we launched CircleCI Enterprise, our solution for customers who need to install CircleCI behind their firewall. Today, customers like Nextdoor, Coinbase, Sony, and dozens of others trust CircleCI Enterprise to increase the throughput of their software teams in a secure, customized environment.
So far in this series, I’ve introduced a computational model of auto scaling build fleets and used it to try auto scaling against randomly varying loads of builds. The ultimate goal is to figure out how best to use auto scaling on CircleCI Enterprise installations, which operate under very different traffic patterns from circleci.com. We hit a bit of a snag in the last post when we realized that VM boot times were just too slow to allow scaling in response to random fluctuations in real-world traffic. This time, we’re going to take into account fluctuations in traffic throughout the day and see if auto scaling can help us there.
In the last post in this series, I introduced a computational simulation of build clusters and used it to determine optimal build fleet sizes taking both machine costs and developer waiting costs into account. The ultimate goal is to learn how best to manage an auto scaling build fleet, but I started with the case of a fixed-size fleet to get a baseline to improve on. Today the auto scaling begins!
Simulating Auto Scaling Build Clusters Part 1: The Mathematical Justification for Not Letting Your Builds Queue
Over the past 5 years, CircleCI has run millions of builds on our rapidly growing fleet of build containers. We’ve tweaked and tuned a lot of parameters to get our huge, auto scaling cluster of servers to always have capacity to run builds while minimizing unused resources.
A new kid on the block
Lately though, after the launch of CircleCI Enterprise, we’ve had lots of customers asking us how to run their own builds in AWS Auto Scaling groups or similar services. Does auto scaling make sense for them? What parameters will keep their CircleCI instance responsive while minimizing idle server time? The problem is, these customers don’t want a several-thousand-container cluster to process thousands of builds per hour from a globally distributed pool of developers. They want a few hundred containers to process dozens of builds per hour from a couple buildings’ worth of developers. They also use different machine types, different container specs, and have different traffic patterns from circleci.com. We’re experts at running a single giant-sized cluster, but now we need to learn to be consultants for the runners of dozens of heterogeneous, mediumish-sized clusters.
Amazon’s Auto Scaling groups (ASG) are, in theory, a great way to scale. The idea is that you give them a desired capacity, and the knowledge of how to launch more machines, and they will fully automate spinning your fleet up and down as the desired capacity changes. Unfortunately, in practice, there are a couple key reasons that we can’t use them to manange our CircleCI.com fleet, one of the most important being that the default ASG termination policy kills instances too quickly. Since our instances are running builds for our customers, we can’t simply kill them instantly. We must wait for all builds to finish before we can terminate an instance.
Today, we are excited to announce not one, but two new offerings!
We spent the better part of 2015 building a better mobile platform. We know software teams don’t just build iOS in a vacuum, it’s generally part of a broader stack. Software teams want to build all of their software on one platform. After acquiring Distiller to help realize our mobile vision, and more than a year in public beta and limited release, CircleCI for OS X is generally available today.