In CI we spend a lot of time talking about both speed and determinism. Continuous integration exists to help teams build and ship software faster, and with quality. We do this by relying on adding small improvements often into our codebase, and employing dynamic testing to ensure we aren’t breaking anything as we go. One key factor in ensuring quality is a concept called determinism. Determinism, in essence, means that with the same inputs, you’ll get the same outputs. A test that runs once and fails shouldn’t pass the next time it’s run if nothing else has changed.
Over the last few months, we have been working on solving and fixing a problem that was affecting our determinism around speed for certain projects. In this post I’ll talk about what we found and how we fixed it. The result is that projects using the Docker executor on CircleCI now have access to configuring RAM disk as a way of performing I/O heavy operations in-memory rather than on SSD.
The problem: same job, different results
An important thing to note about determinism is that without it, it’s hard to measure anything else; a measurement, any measurement (i.e. “how long does this build take to run?”) can only give you meaningful information if you can rely on it to be consistent.
I’m a Solutions Engineer here at CircleCI, so I work with prospective customers to troubleshoot issues that arise as they evaluate the platform. We recently worked with a prospective customer that was experiencing inconsistent performance stemming from cache restores.
With the customer in question, we occasionally saw cache restores taking double or even triple their usual duration (up to 90 seconds). Fixing this was especially important because the customer was in the process of migrating from a legacy Jenkins installation, and commit-to-deploy time – an area where CircleCI traditionally excels – was a key success metric for any new CI/CD setup. This variation in cache restore time made it difficult to get a handle on that crucial metric.
The impact of a variance like this is particularly pronounced in jobs running in parallel. CircleCI gives users the ability to split a large test suite between an array of different machines at once. These parallel builds, however, are only as fast as the longest-running job. Any unnecessary variance can require users to wait for a single job to finish before moving on with their workflow.
While variance in speed is frustrating, it was more concerning to me in my role as a Solutions Engineer, because it made it impossible for me to give accurate reports back to the prospective customer about any speed improvement CircleCI might provide for their builds. I decided to look into this problem a bit more deeply to help this prospect, but ended up finding a solution that helps anyone running node.js on our platform.
Digging in
I decided to explore what might be causing this. How could we speed up commit-to-deploy time? Ultimately we discovered that all of the inconsistencies we were seeing happened while restoring caches and workspaces for Javascript projects running in Docker containers. The problem seemed to stem from the fact that CircleCI runs multiple customers’ Docker containers in the same virtual machine at the same time: if all of a machine’s tenants are making heavy use of hard drive shared among them, the hardware can get bogged down.
All the projects we noticed these inconsistencies with were Node.js based and all cached their node_modules folder. Node modules are an interesting case, because it’s not simply a large artifact to cache (we were seeing projects with the 150-300mb range), but rather an enormous number of small files. One project we examined had a node_modules folder consisting of in excess of 150,000 files. Further, we didn’t seem to experience the same variety in performance for larger caches on other, non-Node.js projects.
The first experiment
The first step to fix the issue was to explore the tar command used for compressing the cache archive. It’s fairly straightforward to compress the specified directories. One immediate observation is that tar isn’t multi-threaded, so increasing the CPU’s available to the job would understandably have no impact. This led to the first experiment, which would be to use an alternative tar implementation, specifically pigz(Parallel Implementation of GZip), in order to reduce the amount of time spent compressing cache archives. Locally, this worked well, demonstrating vastly improved performance. Testing on CircleCI as part of our test project yielded an improved best-case, but still demonstrated the same frustrating variance across multiple runs.
Exploring memory-based solutions
Because CPU didn’t appear to be the bottleneck, we arrived at a new hypothesis that the disk itself might be the problem. A traditional tradeoff we often employ when building software is to trade hard drive space for RAM because memory runs faster with less restrictions. A bonus for our customers is that solving this problem with memory means they’d be able to use resources not just for caching or restoring workspaces, but for any kind of operation in their CI environment using node.js.
In CircleCI’s case, our Docker executor works by hosting Docker containers on an underlying AWS EC2 VM, allocating the necessary CPU and RAM to each job as requested, but the containers all share a local SSD. One idea on how we could test this might be to leverage a ramdisk - and avoid writing to this SSD at all. Conveniently, our containers already had a ramdisk mounted at /dev/shm which facilitated a quick trial of this approach. We saw immediate improvement in both raw performance as well as consistency, though unfortunately /dev/shm is mounted as NOEXEC, which poses an immediate problem if it is to be used as a working directory - users often have executable scripts they need to invoke to install dependencies or run tests.
How we solved the problem
Ultimately we were able to add a new directory in every single container that runs on our platform. And everything in that container will be in memory. We’ve subsequently made a small change to the way we mount Docker containers, in order to expose a separate ramdisk at /mnt/ramdisk. This ramdisk can leverage as much memory as is available in the resource class specified in the users configuration, and does not have the NOEXEC limitation detailed above. For users with large git repositories, caches, workspaces, or otherwise I/O heavy workloads, an easy way to get started is to use the new /mnt/ramdisk location as their working directory via the working_directory setting in their configuration.
RAM disks now provide a convenient way of performing I/O heavy operations in-memory rather than on SSD, while still being able to leverage the convenience, determinism, low-cost, and fast spin-up times of container based CI/CD jobs.
This approach had a few other advantages as well. Because memory is so fast and flexible, it gives customers an additional reason to take advantage of the larger resource classes our platform provides. This change is also useful beyond caching – it can be used for any I/O intensive operation in a customer’s workflow.
This change resulted in two major upgrades in our product. The first, and most obvious, was a 40-percent reduction in workflow times in the jobs we had seen consistency problems with. Here’s an example of a workflow run before and after the fix.
More importantly, however, was that these times were now consistent – more deterministic – on every run. Without variances in this key performance metric, we’re now able to benchmark performance much more easily and accurately, while also alleviating the frustration customers must have felt when they ran into this problem.
This feature is live in our product now, so if your team is running Javascript jobs with caching or workspaces, you now have access to this improvement which could help improve performance and consistency.
If you like finding and solving important problems like this, join our team!