So far in this series, I’ve introduced a computational model of auto scaling build fleets and used it to try auto scaling against randomly varying loads of builds. The ultimate goal is to figure out how best to use auto scaling on CircleCI Enterprise installations, which operate under very different traffic patterns from circleci.com. We hit a bit of a snag in the last post when we realized that VM boot times were just too slow to allow scaling in response to random fluctuations in real-world traffic. This time, we’re going to take into account fluctuations in traffic throughout the day and see if auto scaling can help us there.
Fire up the Big Data Hadoop Cloud
Okay more like fire up the Postgres DB and Python scripts… Anyway, let’s look at how often some actual teams using CircleCI build throughout the day. Looking at any given day for a single company would be pretty random and spikey-looking, but totaling up all the builds in given time windows over a number of days yields some nice smooth curves. Remember, these curves would be flat lines if builds happened randomly at a constant rate throughout the day, but clearly that’s not the case.
The plots of builds run over time mostly look like mittens, with a little dip at lunch time. Except for Organization 4, which seems to either not eat lunch or perhaps be more globally distributed. If you add up all the builds across all orgs on CircleCI, you get a plot that looks like this:
It’s sort of a sine wave with a couple of lunch times and a few employees located in Europe. It’s reasonable to assume that a very large company with teams all around the world sharing a central build cluster would start to look something like this.
Stationary arrival patterns are so out
To model this time-varying traffic, we’re just going to make one change to the simulation. We’ll keep all the random spikeyness from before, because the auto scaling process still definitely needs to handle it, but we’ll add a time-varying function to multiply our random distribution by before sampling it.
To keep things simple, I’m going to use a sine function that varies from 0 at midnight, up to 1 at noon, and back down to 0 at midnight. I’ll multiply the same traffic patterns from last time by this function, so what used to be the constant traffic rates will now be the peak traffic rates.
The big reveal
The nuts and bolts of testing this new traffic pattern are basically identical to last time, with one quick blast of c4.8xlarge action to run through all the possible autoscaling parameters. Before analyzing the results further, I’ll share a quick sample of what an arbitrary day’s worth of simulation time looks like under this traffic pattern:
Yeah! Now that’s what I call auto scaling! Notice how violent the spikes can be, and that the model is optimized to just barely absorb them. Also notice the big, solid chunk of purple in the middle of the day that represents constant utilization of a good chunk of the fleet, which is good for machine efficiency.
After looking at that graph, this plot of savings over fixed-size fleets with the new traffic pattern shouldn’t be surprising:
First, we save a lot more for each traffic level compared to the constant-traffic case. Second, and more importantly, the savings don’t decrease but increase with traffic. When we tried to auto scale against random traffic spikes, more traffic just meant more randomness and spikeyness. When scaling against a more slowly moving base traffic rate though, more traffic means we have a solid foundation of usage (the big purple hunk in the graph above), while still being able to shut off most of the machines at night.
So how do I set up my ASG already?
My primary goal when I started this exercise was to determine if auto scaling made sense for the real-world boot times and traffic patterns of CircleCI Enterprise customers, and if so, what auto scaling parameters yield the best performance?
The answer to the first part of the question seems pretty clear based on the results covered in this post and the last one. Auto scaling against random, minute-to-minute fluctuations in build traffic doesn’t make sense for realistic boot times and traffic patterns, but scaling according to hour-to-hour changes in traffic definitely does.
The second part of the question, about what parameters are best for any given traffic pattern, is much more difficult. My brute-force approach was useful to determine that parameters do exist for which auto scaling saves a lot of money for daily traffic patterns, but that’s not enough to be able to prescribe a complete set of e.g. AWS Auto Scaling group (ASG) parameters for any case. That’s a complicated, many-dimensional modeling problem even in my idealized model, with more complexity in the real world.
That said, I did a bit of simple analysis of the parameters turned up by my optimization effort and found a few clear trends. I captured these in a simple formula to recommend ASG parameters (CloudWatch Alarm timings, thresholds, etc) for a given load. The formula depends only on the overall volume of builds (build minutes per hour) and not on the length of builds, so hopefully it works around the simplification that build times are constant in my model. The formula appears to do a reasonably good job in recommending parameters even for traffic that was well outside of the search space of my brute-force approach:
Note that the formula only does so-so relative to the optimum parameters found by my brute-force search in the area it covered, which is expected because I didn’t capture every possible predictor of good performance in my formula. However, the formula goes on to perform very well in higher-volume cases, which is again expected because those cases should have a bigger foundation of running builds to scale around. It would probably be silly to do much more than this rough pass at recommended parameters based on my model without gathering more real-world data from customers’ clusters.
This concludes the series on auto scaling optimization for now, but our work to improve the CircleCI Enterprise auto scaling experience is far from over. Here are some of the things we’ll do next:
- Get feedback from customers: This simulation business is cool and all, but nothing beats data from the field to test the utility of the model and guide our recommendations.
- Test more traffic patterns and parameters: Using a simple sine wave of traffic was a large simplification. It would be good to test more realistic traffic patterns as well. Getting traffic data from customers may let us compare predicted auto scaling performance to reality. The simple parameter-predicting formula developed so far could also be used to narrow the parameter space and do a higher-resolution search for a better formula.
- Explore alternative auto scaling metrics and models: Experience at CircleCI has shown that scaling according to available capacity is a good way to preemptively ensure that traffic can be handled. However, targeting a constant number based on CloudWatch Alarm thresholds likely leads to overprovisioning at night if traffic is very low. There may be other metrics we can publish or techniques we can use to mitigate this.