Do your Docker builds take too long? Are the Dockerfile instructions executed on each build, even though they've been executed before and didn't change? You should make sure to correctly leverage the build cache. This is particularly important for development teams, as performance penalties - and the time wasted - sum up among team members.
The Fundamental Idea
Every Dockerfile instruction creates a new intermediate image, which is stored in the Docker cache. When parsing a Dockerfile, Docker carefully examines each instruction and checks if there is a cached intermediate image for the instruction. If there is an appropriate image in the cache, Docker can reuse that image instead of running the Dockerfile instruction again.
This simple idea prevents time-consuming tasks from being executed again and again, especially if the instructions haven't been changed at all.
The Fundamental Idea
Docker considers some criteria to find out whether an intermediate image from the cache can be reused. For most instructions, it is sufficient to compare the instruction with the intermediate image and reuse the image if they are equal.
For the ADD
and COPY
instructions, however, this is not enough. Docker has to calculate the checksums for all files
being copied and then compare those checksums with the checksums in the cached image. In case the checksums are equal,
meaning that the file contents haven't changed, Docker can reuse the cached image. Note that Docker doesn't care about
last modified
timestamps: Just because a file has been modified doesn't mean that it is different from the cached
file.
ADD
and COPY
are the only instructions where the cache lookup takes file contents into account. Even an instruction
like RUN apt-get update -y
doesn't look at files or directories, just the command itself matters.
Cache Invalidations
If those conditions aren't fulfilled, the cached intermediate image cannot be reused and the cache is invalidated. This means that the current instruction and all subsequent instructions have to be executed again.
When creating a Dockerfile, you should delay this situation as long as possible, so that as many instructions as possible are executed before a cache invalidation kicks in.
Ordering Dockerfile instructions appropriately is the most effective way to achieve this. You should place the less frequently changed instructions at the beginning and the more frequently changed instructions towards the end of the Dockerfile. For example, it is efficient to first install packages and then copy the source code, because the source code normally changes more often than the required packages.
Incremental Builds
The abovementioned behavior of ADD
and COPY
is simple in principle, but can have extensive impact. Let's assume that
we want to containerize a Node.js project whose dependencies are listed in package.json
. The following Dockerfile
would be highly inefficient:
FROM node:lts # Copy the entire project into /code. COPY . /code # Download all dependencies listed # in package.json. RUN npm ci # Start the application. CMD ["npm", "start"]
What's the problem here? package.json
is copied within the same instruction as the actual source code. After that, all
npm packages are downloaded. In case a file in the source code changes, the checksum of the copied files changes as
well, and therefore, Docker invalidates the build cache. Any subsequent instructions have to be executed again and all
npm packages will be re-downloaded.
Thus, it is important to identify cacheable units and to split them. Dependencies shall only be downloaded when
package.json
has been changed.
FROM node:lts # First of all, only copy package.json # along with its lockfile into /code. COPY package.json package-lock.json /code/ # Download all dependencies listed # in package.json. RUN npm ci # If the dependencies haven't been # changed, the cache is used at least # up to now. Copy the source files. COPY src /code/src # Start the application. CMD ["npm", "start"]
The shown Dockerfile looks much better. In case a source file changes now, the npm packages won't be downloaded anymore.
Instead, Docker recognizes that packages.json
remains unchanged and will appropriately use the cached intermediate
image. Only the source code will be copied due to the change.
We'll take a look at other crucial best practices soon.