Tips for optimizing Docker builds

Docker images are used as the primary image in the Docker executor. They are the blueprints for containers, providing the instructions for how a container is spawned. In this post I’m going to address a few often-overlooked concepts that will help with optimizing the Docker image development and build process.

How do you build a Docker image?

Let’s start with a brief description of the Docker build process. It is a process triggered by running the docker build command using the Docker CLI tool.

The docker build command builds a Docker image based on the instructions specified in a file known as a Dockerfile. The Dockerfile is a text document that contains all the ordered commands a user would call on the command line to assemble an image.

A Docker image consists of read-only layers. Each layer represents a Dockerfile instruction. The layers are stacked, and each one is a delta of the changes from the previous layer. I think of these layers as a form of cache. Updates are only made to the layers that change versus updating every layer on every change.

The example below depicts the contents of a Dockerfile:

FROM ubuntu:18.04
COPY . /app
RUN make /app
CMD python /app/app.py

Each instruction in this file represents a separate layer in a Docker image. Below is a brief explanation of each instruction:

FROM creates a layer from the ubuntu:18.04 Docker image
COPY adds files from your Docker client’s current directory
RUN builds your application with make
CMD specifies what command to run within the container

These four commands will create layers in Docker images when they are executed during the build process.

If you’re interested in learning more about images and layers, you can read about them here.

Optimizing the image build process

Now that we’ve covered a bit about the Docker build process, I’d like to share some optimization advice to help with building images efficiently.

Ephemeral containers

The image defined by your Dockerfile should generate containers that are ephemeral. In this context, ephemeral containers mean containers that can be stopped and destroyed, then rebuilt and replaced with a freshly spawned container, using the absolute minimum setup and configuration. Ephemeral containers can be considered disposable. Every instance is new and unrelated to the previous container instances. When developing Docker images, you should leverage as many ephemeral patterns as possible.

Don’t install unnecessary packages

Avoid installing unnecessary files and packages. Docker images should remain as lean as possible. This helps with portability, shorter build times, reduced complexity, and smaller file sizes. For example, installing a text editor onto a container is not required in most cases. Don’t install any application or service that is not essential.

Implement .dockerignore files

The .dockerignore file excludes the files and directories that match patterns that you declared inside of it. This helps to avoid unnecessarily sending large or sensitive files and directories to the daemon, and potentially adding them to public images.

To exclude files not relevant to the build without restructuring your source repository, use a .dockerignore file. This file supports exclusion patterns similar to .gitignore files.

Sort multi-line arguments

Whenever possible, ease later changes by sorting multi-line arguments alphanumerically. This helps to avoid duplication of packages and it make the list much easier to update. This also makes PRs a lot easier to read and review. Adding a space before a backslash ` \ ` helps as well.

Here’s an example from the Docker’s buildpack-deps image on Docker Hub:

RUN apt-get update && apt-get install -y \
  bzr \
  cvs \
  git \
  mercurial \ 
  subversion \
  && rm -rf /var/lib/apt/lists/*

Decouple applications

Applications that are dependant on other applications are considered “coupled.” In some scenarios, they are hosted on the same host or compute node. This is common in non-container deployments, but for microservices, each application should exist in its own individual container. Decoupling applications into multiple containers makes it easier to scale horizontally and to reuse containers. For instance, a decoupled web application stack might consist of three separate containers, each with its own unique image: one to manage the web application, one to manage the database, and one for an in-memory cache.

Limiting each container to one process is a good rule of thumb. Use your best judgment to keep containers as clean and modular as possible. Then, if containers depend on each other, you can use Docker container networks to ensure that these containers can communicate.

Minimize the number of layers

Only the RUN, COPY, and ADD instructions create layers. Other instructions create temporary intermediate images, and ultimately do not increase the size of the build. Where possible, only copy the artifacts you need into the final image. This allows you to include extra tools and/or to debug information in your intermediate build stages, without increasing the size of the final image.

Leverage build cache

In building an image, Docker steps through the instructions in your Dockerfile, executing each in order. At each instruction, Docker searches for an existing image in its cache to use instead of creating a new duplicate image. This is the basic rule that Docker follows:

Starting with a parent image that is already in the cache, the next instruction is compared against all child images derived from that base image to see if one of them was built using the exact same instruction. If not, the cache is invalidated.

In most cases, simply comparing the instructions in the Dockerfile with one of the child images is sufficient. However, certain instructions require more examination and explanation.

For the ADD and COPY instructions, the contents of the file(s) in the image are examined and a checksum is calculated for each file. The last-modified and last-accessed times of the file(s) are not considered in these checksums. During the cache lookup, the checksum is compared against the checksum in the existing images. If anything has changed in the file(s), such as the contents and metadata, then the cache is invalidated.

Aside from the ADD and COPY commands, cache-checking does not look at the files in the container to determine a cache match. For example, when processing a RUN apt-get -y update command, the files updated in the container are not examined to determine if a cache hit exists. In that case, the command string is used to find a match.

Once the cache is invalidated, all subsequent Dockerfile commands generate new images and the cache is not used. Leveraging your cache involves layering your images so that only the bottom layers change often. You want your RUN steps that change more frequently towards the bottom of the Dockerfile, while steps that change less often should be ordered towards the top.

Optimize Docker image builds in CI pipelines

So far, I’ve focused on optimizing Docker image builds while covering the concepts related to this process from a code and Docker CLI build perspective. It is not largely different to implement these optimization tactics into CI pipelines, and CircleCI has a specific Docker build optimization that will dramatically speed up your automated Docker build jobs.

First, all of the optimization concepts mentioned in previous sections are valid for implementing into your CI pipeline. Especially caching. If there is a change to the Dockerfile, leveraging your cache is still the most optimal way to reduce the build time.

How does this work as a part of your CI pipeline? When using the Docker executor as the runtime for build jobs, you can leverage a feature called Docker layer caching(DLC) to speed up those builds.

DLC is a great feature to use when building Docker images is a regular part of your CI process. DLC will save image layers created within your jobs. DLC caches the individual layers of any Docker images built during your jobs, and then reuses unchanged image layers on subsequent CircleCI runs, rather than rebuilding the entire image every time.

The less your Dockerfiles change from commit to commit, the faster your image-building steps will run. DLC can be used with the machine executor and the remote Docker environment (setup_remote_docker). It is important to note that DLC is only useful when creating your own Docker image with docker build, docker compose, or similar Docker commands, it does not decrease the wall clock time that all builds take to spin up the initial environment. If you’re interested in learning more about DLC, you can read about it in our documentation.

For more Docker content, see Guide to using Docker for your CI/CD pipelines.

Summary

In this post, I covered optimization techniques for building Docker images. The build recommendations provided will serve as a guide for you in developing Docker images efficiently. CI pipelines are sped up tremendously by these build recommendations.

Most folks don’t need to build their own custom images. At CircleCI, we have built a fleet of CI-optimized Docker images for use in your CI pipelines. Read about our decisions for the development of the images in Announcing our next-generation convenience images: smaller, faster, more deterministic.

Thank you for following this post and I hope you found it useful. Please feel free to reach out with feedback on Twitter @punkdata.