I'm a Backend Engineer focused on Go, Kubernetes and CI/CD.
Posted on Jul 8 2020 — Photo: D J on Pexels
The build speed for a Docker image largely depends on whether the instructions are cached or not. Understanding the build cache is crucial for building efficient images.
Do your Docker builds take too long? Are the Dockerfile instructions executed on each build, even though they've been executed before and didn't change? You should make sure to correctly leverage the build cache. This is particularly important for development teams, as performance penalties - and the time wasted - sum up among team members.
Every Dockerfile instruction creates a new intermediate image, which is stored in the Docker cache. When parsing a Dockerfile, Docker carefully examines each instruction and checks if there is a cached intermediate image for the instruction. If there is an appropriate image in the cache, Docker can reuse that image instead of running the Dockerfile instruction again.
This simple idea prevents time-consuming tasks from being executed again and again, especially if the instructions haven't been changed at all.
Docker considers some criteria to find out whether an intermediate image from the cache can be reused. For most instructions, it is sufficient to compare the instruction with the intermediate image and reuse the image if they are equal.
COPY instructions, however, this is not enough. Docker has to calculate the checksums for all files
being copied and then compare those checksums with the checksums in the cached image. In case the checksums are equal,
meaning that the file contents haven't changed, Docker can reuse the cached image. Note that Docker doesn't care about
last modified timestamps: Just because a file has been modified doesn't mean that it is different from the cached
COPY are the only instructions where the cache lookup takes file contents into account. Even an instruction
RUN apt-get update -y doesn't look at files or directories, just the command itself matters.
If those conditions aren't fulfilled, the cached intermediate image cannot be reused and the cache is invalidated. This means that the current instruction and all subsequent instructions have to be executed again.
When creating a Dockerfile, you should delay this situation as long as possible, so that as many instructions as possible are executed before a cache invalidation kicks in.
Ordering Dockerfile instructions appropriately is the most effective way to achieve this. You should place the less frequently changed instructions at the beginning and the more frequently changed instructions towards the end of the Dockerfile. For example, it is efficient to first install packages and then copy the source code, because the source code normally changes more often than the required packages.
The abovementioned behavior of
COPY is simple in principle, but can have extensive impact. Let's assume that
we want to containerize a Node.js project whose dependencies are listed in
package.json. The following Dockerfile
would be highly inefficient:
FROM node:lts # Copy the entire project into /code. COPY . /code # Download all dependencies listed # in package.json. RUN npm ci # Start the application. CMD ["npm", "start"]
What's the problem here?
package.json is copied within the same instruction as the actual source code. After that, all
npm packages are downloaded. In case a file in the source code changes, the checksum of the copied files changes as
well, and therefore, Docker invalidates the build cache. Any subsequent instructions have to be executed again and all
npm packages will be re-downloaded.
Thus, it is important to identify cacheable units and to split them. Dependencies shall only be downloaded when
package.json has been changed.
FROM node:lts # First of all, only copy package.json # along with its lockfile into /code. COPY package.json package-lock.json /code/ # Download all dependencies listed # in package.json. RUN npm ci # If the dependencies haven't been # changed, the cache is used at least # up to now. Copy the source files. COPY src /code/src # Start the application. CMD ["npm", "start"]
The shown Dockerfile looks much better. In case a source file changes now, the npm packages won't be downloaded anymore.
Instead, Docker recognizes that
packages.json remains unchanged and will appropriately use the cached intermediate
image. Only the source code will be copied due to the change.
We'll take a look at other crucial best practices soon.