Motivation
- The players
Example: sdf2smiles
Dockerize
- Dockerfile
- Building and using the container
Conclusion

If you already are already motivated to dockerize something, skip to the "dockerize" section (linked above in the TOC).

Motivation

There are a few use cases for putting a tool inside a container:

You want to share the tool
You don't want changes in the environment to change the tool output
You want the tool to be stable and not break with software updates

Containerizing a tool can also be useful if it is problematic to install. Maybe it has a lot of dependences, or dependences that conflict with the environment you want to run it in. For example, maybe you have a python package that only works in python 3.8, and since the python takes in a text file as input and produces a json file as output, you don't need to actually import it it your production environment. You could create two separate python environments, but then when you want to share your production workflow, you now have to explain the two different python environments that must be created to use your workflow.

The players

These are the tools & python packages that we will be using for this tutorial.

Tool	Website	Version
`micromamba`	https://mamba.readthedocs.io/en/latest/user_guide/micromamba.html	`2.0.8`
`docker`	https://docs.docker.com/engine/	`28.0.1`
`openff-toolkit`	https://docs.openforcefield.org/projects/toolkit/en/stable/	`0.16.8`

Example: sdf2smiles

For the purpose of this tutorial, I am going to assume you have a working docker installation and some familiarity with the conda package manager. Our motivating example is a little script that I wrote to convert molecules in an SDF file into a SMILES string using the excellent openff-toolkit package. The script takes in a file path as an argument. I'll break the script down into pieces to explain how it works, but first I'll reproduce it in its entirety:

import sys

from openff.toolkit import Molecule


def convert(sdf_path: str) -> None:
    """
    Converts a structure from an SDF file to SMILES notation and prints the result.

    Args:
        sdf_path (str): The path to the SDF file containing molecular structures.

    This function reads the SDF file, parses the molecules, and converts them into SMILES format.
    For multiple molecules in the SDF file, it will print the SMILES representation of each.
    For a single molecule, it prints the SMILES notation of that molecule.
    """
    print(f"Converting {sdf_path} to SMILES")

    mols = Molecule.from_file(sdf_path)
    match mols:
        case list():
            for mol in mols:
                print(f"{mol.name}: {mol.to_smiles()}")
        case Molecule():
            print(f"{mols.name}: {mols.to_smiles()}")


if __name__ == "__main__":
    convert(sys.argv[1])

The script isn't very sophisticated and doesn't have any error handling, but it gets the job done!

First, the imports:

import sys

from openff.toolkit import Molecule

I like to follow the convention where the imports are split into three parts and each part is alphabetized (the motivation for this and many other code formatting decisions is to minimize the size of a diff), for example:

# standard libary
import math

#3rd party
import numpy as np

# Your own modules
from sdf2smiles import convert

Now the function signature:

def convert(sdf_path: str) -> None:

Here we are defining a function that takes in a string as input but doesn't return anything. The types hints here (str and None) are optional (there are no run-time checks) but they help other developers and tools understand what you are trying to do. So we could re-write it as

def convert(sdf_path):

which may be more familiar. When type hints were first introduce in python they were much clunkier to work with. Now they are much easier to work with and, with a python 3.10 or newer codebase, most of the warts are gone.

The next bit is a numpy style doc string:

"""
Converts a structure from an SDF file to SMILES notation and prints the result.

Args:
    sdf_path (str): The path to the SDF file containing molecular structures.

This function reads the SDF file, parses the molecules, and converts them into SMILES format.
For multiple molecules in the SDF file, it will print the SMILES representation of each.
For a single molecule, it prints the SMILES notation of that molecule.
"""

When you use some standard way of writing doc strings like this, you can use automatic tooling that can turn these strings into HTML for your documentation. Even if you don't think you will end up building documentation for your project strings like this make it really easy later to understand what you were doing when you look back on your code. Something to keep in mind when doing development work is that code is written "once" but read many times, so think about how much of a favor you are doing future you next time you find writing doc strings laborious.

Now we open, parse, and create an openff.toolkit Molecule object (easy to do with a nice API).

mols = Molecule.from_file(sdf_path)

Now the SDF file format allows more than one molecule to be defined in the file. If there is only one molecule, Molecule.from_file returns a Molecule object. If there is more than one molecule, Molecule.from_file returns a list of Molecule objects. This means if that if we want to support both cases (print the SMILES representation for every molecule in the SDF) we have to handle Molecule.from_file returning either a Molecule or a list of Molecule objects. We can use python's match statement to elegantly handle this:

match mols:
    case list():
        for mol in mols:
            print(f"{mol.name}: {mol.to_smiles()}")
    case Molecule():
        print(f"{mols.name}: {mols.to_smiles()}")

Here we "match" the "case" of mols being a list() or a Molecule(). I find this easy to read and understand. Another way of doing this would be to use a combination of if/else and isinstance. One of the reasons the match statement was introduced in python (PEP-622) is because how often this problem comes up.

The difference in how we handle the different cases is if we have a list we loop over the Molecule objects in the list and print their name and SMILES string:

for mol in mols:
    print(f"{mol.name}: {mol.to_smiles()}")

And in the case of just getting a Molecule object back, we can print it directly:

print(f"{mols.name}: {mols.to_smiles()}")

The naming of the variable here isn't great, but it works!

Next we tell python how to act if this script is directly executed:

if __name__ == "__main__":
    convert(sys.argv[1])

In python __name__ becomes __main__ in the "top-level code environment" (details here). So here we are saying "if someone runs $ python convert.py propane.sdf run this code block". In this case, that means "call the convert function with sys.argv[1] as the argument. sys.argv is a list where the first item sys.argv[0] is the name of the script, and the rest of the items in the list are the arguments passed into the script, in this case propane.sdf. The reason we have if __name__ == "__main__" instead of just writing

convert(sys.argv[1])

at the end of our script is that if someone was to import sdf2smiles that bit of code would run. But, since we have if __name__ == "__main__":, python "knows" to not run that bit of code if the file is imported, but only run that code if the file is called directly with the python interpreter.

That concludes the explanation of the script. It isn't meant to be the best example of how someone could write a tool that does this, but an accessible example.

Dockerize

Now we are going to package out script into a docker container. In principal, this technique will work for "anything" and it doesn't have to be a python script. It could be a command line tool or an R script. This example is set up for the case where you have a tool that needs to read files on your computer, but not write any files back. That case is slightly more complicated but very tractable and will be covered in a future tutorial.

Dockerfile

In order to build a docker container, we need to use a Dockerfile to tell docker how to build the container. I'll reproduce the full Dockerfile here, and then break it down like we did with the python script above.

# Docker file we are building on top of 
FROM ghcr.io/mamba-org/micromamba:2.0.8-alpine3.20

# Some metadata for the image
LABEL org.opencontainers.image.source=https://github.com/mikemhenry/sdf2smiles
LABEL org.opencontainers.image.description="Convert SDF into SMILES"
LABEL org.opencontainers.image.licenses=MIT

# We need to copy in files needed to build the envrionment 
COPY --chown=$MAMBA_USER:$MAMBA_USER environment.yaml convert.py .

# We now install what we need into the base micromamba environment
# See https://micromamba-docker.readthedocs.io/en/latest/quick_start.html#quick-start
RUN micromamba install -y -n base -f environment.yaml && \
    micromamba clean --all --yes

# The environment isn't automatically activated for `docker build` (but it is for `docker run`)
ARG MAMBA_DOCKERFILE_ACTIVATE=1

# We need a place to mount files inside the container
USER  root
RUN mkdir -p /opt/app/
USER $MAMBA_USER

# Set the working directory 
WORKDIR /opt/app/

# Set entrypoint, this makes it so we just have to pass in an SDF file inot the container
# and not worry abbout the path to the script
ENTRYPOINT ["/usr/local/bin/_entrypoint.sh", "python", "/tmp/convert.py"]

To start us off, we tell Docker what to use as the "base" image using the FROM command:

# Docker file we are building on top of 
FROM ghcr.io/mamba-org/micromamba:2.0.8-alpine3.20

Docker images are built in "layers". This lets us leverage caching layers (so we don't need to build from scratch every time) and also create general purpose containers that we specialize for different use cases.

The ghcr.io part is the name of the docker repository (dockerhub.io is the default if one isn't listed). The mamba-org/micromamba is the organization and the image repository. The part after the : (2.0.8-alpine3.20) is called the "tag". The "tag" is normally the version of the image and anything else that might be relevant (like "cuda" or "dev"). I always recommend using a tag that has some version in it, and avoid using the "latest" tag, since if you rebuild the container a week later, you will likely be pulling in a different image and it may break things.

Since we are going to use micromamba to handle creating an environment for our script, it makes sense to use a docker image that has micromamba already set up. I like to use the containers linked above as a base whenever I am making a container that uses micromamba. They also have containers that work with nvidia GPUs which is nice if you want to containerize some software that uses GPUs. A full list of tags can be found here.

Here we are going to use an image that is based on micromamba 2.0.8 and alpine 3.20. alpine is a linux distribution that focuses on being as small as possible. One way it does this is by using a different runtime environment (no glibc) than a "normal" linux distribution which can cause headaches, but lucky for us, micromamba-docker sets things up so things mostly work (they include glibc on the container). If this is your first time ever having to think about things like "is glibc on this container?" I recommend using micromamba:2.0.8-debian11-slim which is based on a much more "normal" linux distribution that has been slimmed down (doesn't contain a desktop environment for example).

Next we define some metadata:

# Some metadata for the image
LABEL org.opencontainers.image.source=https://github.com/mikemhenry/sdf2smiles
LABEL org.opencontainers.image.description="Convert SDF into SMILES"
LABEL org.opencontainers.image.licenses=MIT

This is optional, but will help people know where to look for where this container came from, what it does, and what license governs it.

Next we will copy files we need inside the container:

# We need to copy in files needed to build the envrionment 
COPY --chown=$MAMBA_USER:$MAMBA_USER environment.yaml convert.py .

Here we use the COPY command to put the environment.yaml and convert.py file into the . directory (which for this image is /tmp). We also set the user and group owner of the files to be $MAMBA_USER which helps to ensure we don't end up with permission issues (since your user doesn't exist inside the container). You can view the contents of those files here.

Now that we have our environment definition on the container, we create it with:

# We now install what we need into the base micromamba environment
# See https://micromamba-docker.readthedocs.io/en/latest/quick_start.html#quick-start
RUN micromamba install -y -n base -f environment.yaml && \
    micromamba clean --all --yes

The RUN command means "run this using the SHELL" inside the container. You might be wondering why we are running two commands within the same RUN command (recall && in bash means "run the next command if the previous was successful") and why we are breaking it into two lines with \. Each RUN command essentially builds a layer of our docker container, meaning that if we build our image, modify the docker file and only add a new RUN command at the end of the file, we don't need to re-build the container and instead can just "add" a new layer, which really speeds up build time. This does have an unfortunate side affect where commands like sudo apt clean and micromamba clean --all won't actually reduce the image size if they are not run in the same layer that generates the files they are meant to clean up. The files won't be on the container, but the space they took up will be "trapped" on a previous layer.

Next we set an environmental variable to activate our conda environment for RUN commands.

# The environment isn't automatically activated for `docker build` (but it is for `docker run`)
ARG MAMBA_DOCKERFILE_ACTIVATE=1

For this example we don't actually need to do this, since we don't have any more RUN commands that need to run inside our environment. I copy and paste this example around all the time and don't want to forgot this step, so I include it since it doesn't hurt anything. This line would be important if you had to run some commands inside the container, like if you wanted to run pytest as part of the build step to make sure things are working or run an initialization script:

ARG MAMBA_DOCKERFILE_ACTIVATE=1

RUN pytest --pyargs openff-utils
RUN openfe fetch rbfe-tutorial

Without setting MAMBA_DOCKERFILE_ACTIVATE=1 we would get errors like pytest command not found, since we need to run that command inside our environment.

Next, we to create a place to mount files we want the script to read:

# We need a place to mount files inside the container
USER  root
RUN mkdir -p /opt/app/
USER $MAMBA_USER

# Set the working directory 
WORKDIR /opt/app/

First we use the USER command to switch to the root user (since the $MAMBA_USER doesn't have permissions to create a directory in /). Then we use the RUN command to create a directory in /opt/app/ (the -p switch in mkdir lets you make multiple directories at once, it would be the same as mkdir /opt; mkdir /opt/app/ The location isn't really important as long as it doesn't already exist on the container. We want a fresh location since, when we run this container, we are going to want to mount a folder on our computer to this location. If there was something important in this folder in our container, our local files would overwrite it and that would make the container sad. Don't make the container sad. We then switch our user back to the $MAMBA_USER with the USER command since we don't want to run any other commands as the root user anymore. Then we use the WORKDIR command to set the working directory for the image. This will save us some typing when we want to run the container. The command WORKDIR also sets the working directory for subsequent RUN and COPPY commands, so it can be helpful to set the WORKDIR at the start of the Dockerfile to simplify the paths you need to include in commands.

Now we are the end of our Dockerfile and have just one more command to explain:

# Set entrypoint, this makes it so we just have to pass in an SDF file inot the container
# and not worry abbout the path to the script
ENTRYPOINT ["/usr/local/bin/_entrypoint.sh", "python", "/tmp/convert.py"]

If you want to treat your container like an executable, then setting an ENTRYPOINT is a good choice. For this example, we want our command to look something like <docker command> ethane.sdf so ENTRYPOINT is a good choice. A container can only have one ENTRYPOINT so when we define one, we overwrite the one defined in the base mamba-org/micromamba image. The docs here explain how to set up an ENTRYPOINT that will run inside the environment that we created. The first part of the ENTRYPOINT command ensure that python is from our environment, and then we pass in the path to our script (which we copied into the /tmp folder near the beginning of our Dockerfile). This allows us to run our container like <docker command> posed_ligands.sdf. See this doc for more information about the different docker commands you can use inside a Dockerfile.

Now, before we build our container using our Dockerfile, we have one more topic to cover - the .dockerignore file. I like to use a .dockerignore file as a best practice when working with docker images. By default, everything (recursively) in the folder where you run the docker build command is added to the docker build context. This is helpful since it lets us reference these files in our Dockerfile, but it also means that files we didn't want are added (like large files not relevant to our build, or .git) which can increase the final size of the container.

Since I often have a Dockerfile in a repositry that has a lot of stuff, I like to setup my .dockerignore file to ignore everything, then just add the bits I need:

# ignore everything
*

# only allow
!environment.yaml
!convert.py

Building and using the container

Now that we have a Dockerfile and .dockerignore, we are ready to build the image. We will use the docker build command with a few extra arguments:

$ docker build . -f Dockerfile -t mmh42/sdf2smiles:latest

Since I am running this in the same folder I have my Dockerfile, I can set . as the context. I then use -f Dockerfile to tell docker build what file to use (more useful when you have multiple Dockerfiles). The -t mmh42/sdf2smiles:latest bit at the end tells docker how to "tag" (or name) the container. Since my username on http://dockerhub.io is mmh42, if I tag my container in the format <username>/<project>:<tag> it will know where to put the image when I upload it with docker push. If you are building the image for local use, something like -t my-container will work (but isn't very descriptive).

Now that the container is built, this is how we can use it with the docker run command:

$ docker run --rm -v$(pwd):/opt/app mmh42/sdf2smiles:latest posed_ligands.sdf

The --rm option tells docker to delete the image after we run the command. Whenever you use docker run, an image is created from the container that stores any modifications you make and is saved by default, even if you don't actually modify the container. This is nice if you spin up an image, do things, and then want it to stick around and use it later (or you made important changes on it). But if you are expecting to just spin up a image, do a thing, then never touch it again, without --rm you will start filling up your disk. The -v$(pwd):/opt/app argument says "mount my current directory to /opt/app inside the container. This means that everything in pwd will show up inside /opt/app. This is important since by default, the container can't read any files on your computer. Since our convert.py script needs to be able to read in the sdf file, we need to expose it to our container. This next bit mmh42/sdf2smiles:latest is the name of the container we want to run. Because of the way we set up our ENTRYPOINT we then just need to pass in sdf file we want to read in posed_ligands.sdf (and don't need to run it like python convert.py posed_ligands.sdf like we would if we ran this locally).

Conclusion

That's all folks! Some future topics are multi-staged builds to make the final container even smaller, and some tricks we can use to build the image faster.

Note:

This blog post in in sync with this commit. There may be updates in the future which will cause this post and the linked repostiroy to diverge. I do not plan to keep them perfectly in sync but will update both to fix any major bugs. If you do notice an issue, please raise an issue ☺️