When running a cloud service, it’s never good for customers to be the first people noticing an issue. It happened to our customers over the course of a few months, and we began to accumulate a series of reports of unpredictable start-up times for Docker jobs. At first the reports were rare, but the frequency began to increase. Jobs with high parallelism were disproportionately represented in the reports.
While we had suspicions about which parts of spin-up we thought were slow, we wanted to be sure. To begin, we looked at the specific jobs where the issue was being reported. To help with this, we improved the logs in the spin-up step so they showed how long each phase was taking. The logs showed that in the problem cases, the download, extract, and create phases were taking the most time.
Here’s why. When a Docker job starts, we look at the list of Docker images a user has specified in their config, and pull them all to make sure they are in the image cache on the host VM. Pulling an image happens as a result of a number of stages for each of the image layers:
- During download, we fetch the compressed layer from the Docker repository
- During extraction the layer is decompressed
- We then create a container for each image in the config
- The job steps are then run in the primary container
This process is repeated n times, where n is the parallelism of the job. Each of these instances of the job is referred to as a “task”. Using these definitions, it becomes clear why a job with high parallelism is likely to be affected more. The job simply rolls the “slow down dice” more times.
Adding to the problem was that we were not able to quickly easily identify the problem ourselves. We were determined to address the reports by our customers, but we also wanted to make sure we were more likely to spot issues ourselves in the future. Both ideas were taken into consideration as we created a solution.
Initially, we needed to answer the following questions for the download, extract and create phases:
- What is the typical performance like?
- What does “slow” mean?
- How frequently are things “slow”?
We started investigating how to produce metrics for each of the three phases: download, extract, and create. Tracking the performance of layer download and extract proved to be a bit tricky.
- Docker’s pull API does not directly provide detailed aggregate information about how long layers take to download or extract.
- The pull API provides only a JSON message stream designed to produce the progress bars. You are probably familiar with the message stream if you have ever run docker pull on the command-line.
After much contemplation of the message stream, we discovered that by aggregating the time and data transferred across each of the layers and download / extract phases, we could produce fairly accurate performance metrics.
Fortunately measuring the performance of the container creation was as simple as adding a timer metric!
In addition to feeding this data into our metrics processing system, we also included it in the logs for the spin-up step:
... Starting container cimg/go:1.16 image cache not found on this host, downloading cimg/go:1.16 1.16: Pulling from cimg/go ... Digest: sha256:6621f92d57703a89f82d20660e8947aa00eb07baba90be16e64666a3c726077c73MB/1.873MBB Status: Downloaded newer image for cimg/go:1.16 pull stats: download 124.3MiB in 1.283s (96.84MiB/s), extract 124.3MiB in 2.806s (44.28MiB/s) time to create container: 3.201s using image cimg/go@sha256:6621f92d57703a89f82d20660e8947aa00eb07baba90be16e64666a3c726077c Starting container circleci/postgres:12.4-ram image cache not found on this host, downloading circleci/postgres:12.4-ram 12.4-ram: Pulling from circleci/postgres ... Digest: sha256:368f896494f6cc56cbd1993b1f789548e78a1348e88e94a79aeca7c6ca8f8ac328kB/8.228kBB Status: Downloaded newer image for circleci/postgres:12.4-ram pull stats: download 108.1MiB in 1.099s (98.33MiB/s), extract 108MiB in 3.853s (28.03MiB/s) time to create container: 2.973s</b> using image circleci/postgres@sha256:368f896494f6cc56cbd1993b1f789548e78a1348e88e94a79aeca7c6ca8f8ac3 Time to upload agent and config: 1.256190772s Time to start containers: 1.347100754s
Deciding what “a slow pull” is
CircleCI does not control download speed; it relies on the performance of externally hosted Docker repositories. Where we can, we make sure we have fast connectivity to popular repositories, including both Docker Hub and Amazon’s ECR. Google appears to have done a good job with GCR’s speed, even though our connectivity for that depends on the public internet. Based on the data coming out of our metrics, we settled on 20 MiB/s as a good indicator.
Extract is much more under our control, being entirely dependent on the resources available to the VM running Docker. We found 10 MiB/s to be a good indicator that extract performance had degraded enough to cause noticeable user impact. Due to rounding issues for small values, we found we had to exclude pulls that took less than 5 seconds to complete.
For create, we decided (somewhat arbitrarily) that containers should take no longer than 10 seconds to create. I know I wouldn’t want to wait longer than that!
So how bad was the problem?
Sadly it was much worse than we expected. This is what our dashboard showed for November 2020:
- 3.5% of downloads were slow.
- 6.7% of extracts were slow.
- 12% of creates were slow - yikes!
These figures (especially for create) were alarming! However, the team had some ideas about what the cause could be, and how the figures could be improved.
What we did to improve our findings
The first, and easiest, thing we did was simply upgrade the Docker infrastructure we were running on. We had spotted various patches in the AUFS file-system driver that looked related to container creation performance. The results:
- 3.5% of downloads were slow.
- 7.8% of extracts were slow.
- 0.77% of creates were slow - much better!
This provided a significant immediate improvement to container creation times, though at a small cost to extract times. This was a promising turn in the right direction, but not enough on its own.
What about layer download and extract performance?
Now that we were happy with the container create performance, we turned our attention to the other two metrics: layer download and extract. Getting to a solution wasn’t as simple here. We spent a fair amount of time instrumenting the Docker daemon to determine which parts of the Docker layer subsystems were going slowly. We were hoping to identify an issue like lock contention. This search produced inconclusive results, perhaps because of compatibility issues with running Docker containers on personal Linux / Mac machines. The fact that AUFS was going to be deprecated might also have been a contributing factor.
Also of note: As a side-effect of switching to the overlay2 file-system driver, we saw significantly improved disk performance.
We had also been investigating a move to the overlay2 filesystem driver from the AUFS driver that we had been running. We decided that trialing a small proportion of our Docker fleet on the overlay2 driver would allow us to compare the performance against AUFS, and work out any issues customers might face.
The comparison metrics looked very promising. While there was a slight regression on the creation metric, the extract performance improved significantly, and the download improved slightly.
Slow extracts improvements
Slow creates improvements
The next part of this process was to slowly start rolling out this change to all of our users. We expected to hit small compatibility issues along the way, and worked hard with our customers to help them get past any problems they had. While the statistics were looking much better, we felt there was still room for improvement, especially because as the rollout progressed, the slow create statistic crept back up to 1.7%.
Improving the CircleCI Docker garbage collector
We have a component we call docker-gc that has the responsibility of maintaining the health of our Docker instances. Its two responsibilities are making sure the instances don’t run out of storage space, and deleting any now unused Docker objects like containers, images, volumes and networks.
The current implementation of this was a simple BASH script, scheduled by a cron-job to run regularly. This had several issues:
- It was not observable, producing only simple logs.
- If a GC cycle took a long time, the cron scheduler would try to run two cycles at the same time.
- It was not easy to test because BASH itself is not particularly testable.
- Roll-out of the changes took a long time, because they were included in the VM image for the instances. That required instance replacement to update.
Go is an excellent language for writing low-resource-footprint agents like this. We decided to port the script to Go and deploy it as a Docker container itself. As well as immediately addressing all the issues we encountered, we gained insight into both the negative and positive effects docker-gc was having on our Docker instances.
What we saw was pretty clear. There was a significant impact on the spin-up phases (download, extract, and create) when docker-gc was running its garbage collection cycle. Because our fleet is quite large, this wasn’t having a large percentage impact on the overall figures. For those tasks unlucky enough to start running when docker-gc collected garbage, though, it had a negative impact.
We spent about a month tweaking things, powered by the new observability:
- Docker / CircleCI build-agent was only ever leaking networks.
- Pruning volumes and containers was taking a long time inside Docker, pruning results showed that nothing was deleted. As a result of these new insights, we changed docker-gc to prune only networks.
- Deleting a large quantity of images rapidly was strongly correlated with spin-up slowdown. This appeared to be related to lock contention inside the Docker daemon, but a fix there proved elusive.
We wanted to run the garbage collection cycle more frequently, but we were concerned this would make the problem “less severe, but more frequent”. After some thinking on this, we decided to suspend garbage collection when any task on the instance was in the “spin-up” state.
The results of the project - success!
Following all these optimizations, we were able to achieve the following improvements.
- Slow downloads: 3.6% to 3.2%, a 0.4% improvement
- Slow extracts: 6.7% to 0.84%, a 5.8% improvement
- Slow creates: 12% to 1.7%, a 10.3% improvement
So, progress made. Still, the team feels there are likely further improvements that can be made. We will be sure to share them with you.