15.2 Containers in HPC

Table of Contents

Introduction

Containers have become an important tool in high performance computing because they isolate software environments while still allowing efficient access to the underlying hardware. In an HPC context, containers are used to package applications, their dependencies, and sometimes parts of the runtime environment into a portable unit that can be moved between systems with minimal changes. This chapter focuses on what is special about containers in HPC settings, how they differ from traditional container use on laptops or in the cloud, and which practical constraints you must consider when using them on clusters.

Containers versus traditional software environments

In previous chapters you have seen environment modules and software stacks as mechanisms to control which compilers, libraries, and tools are visible in your shell. Containers address the same problem of software management, but they do it by bundling the application and its dependencies into an image file. This image is executed by a container runtime that sets up namespaces, mounts file systems, and controls which resources the process can see.

In traditional environments on a cluster, the system administrators ensure that the installed software is compatible with the operating system and hardware. You then select a combination such as a specific compiler, MPI library, and math library through modules. With containers, this responsibility partially shifts to the user who defines or selects an image that already includes a specific user space, libraries, and tools. The host system still provides the kernel and hardware, but much of the user level software is controlled by the container image itself.

In HPC, containers are not a replacement for modules and software stacks. Instead, they complement them. It is very common to load some modules on the host and then launch a container, or to use container images that are designed specifically for a given cluster environment.

Why container technology in HPC is different

Most introductory tutorials on containers assume a cloud or developer workstation scenario with full control over the machine and often with root access. HPC clusters are multi user systems that emphasize security, performance, and fair resource sharing. These requirements shape how containers can be used in several ways.

First, users usually do not have root privileges on login and compute nodes. Standard container runtimes like Docker expect root or a privileged daemon. Running such daemons on shared HPC systems is a security risk and is typically forbidden. HPC oriented container runtimes such as Singularity and its successor Apptainer are designed to allow unprivileged container execution. They perform the privileged operations, if any, at image build time and not at runtime, and they drop privileges before executing user code.

Second, batch schedulers manage resources. On an HPC cluster you rarely run containers directly on the login node. Instead you submit a job script to the scheduler, request nodes, cores, and GPUs, and then invoke the container runtime from within the job allocation. The scheduler controls CPU and memory limits, as well as placement of tasks across nodes, while the container runtime sets up the software environment within those limits.

Third, high performance interconnects and file systems must remain usable from within containers. The container runtime must expose hardware such as InfiniBand devices and high performance parallel file systems without adding significant overhead. HPC containers are therefore generally "thin" in the sense that they reuse the host kernel and drivers, and rely on the host for advanced networking and I/O.

In HPC, containers are almost always user level, non root, and scheduler aware. They must preserve high performance access to hardware, interconnects, and parallel file systems.

Common container runtimes in HPC

Although many container runtimes exist, HPC environments have converged on a few that meet security and performance requirements.

Apptainer, formerly known as Singularity, is currently the most widely used container runtime in academic and research HPC centers. It runs containers as regular user processes and does not require a daemon. An Apptainer image is typically a single file with a .sif extension. This file contains the root file system and related metadata. Apptainer integrates well with MPI and batch schedulers, which makes it attractive for running parallel jobs.

Charliecloud is another rootless container solution that relies on existing tools like unprivileged user namespaces. It is often favored in environments where simplicity and transparency are more important than feature richness. It uses directory trees and tar archives to represent images, instead of a single image file format.

Podman and other rootless variants of Docker compatible runtimes are also sometimes deployed in HPC environments, especially in mixed cloud and HPC infrastructures. However, when they are used directly on shared clusters, they are configured very carefully by system administrators to avoid security problems.

As an HPC user, you generally do not choose the runtime arbitrarily. Each cluster documents which runtimes are supported and how to use them within a job script. You must follow that documentation because the runtime will be integrated with the scheduler, the module system, and potentially site specific libraries.

Building versus running containers in HPC

Building container images is fundamentally different from running them. On a laptop or in the cloud you might both build and run containers on the same machine, often as root. On HPC systems, direct building on the cluster is either restricted or discouraged, especially if the build process requires elevated privileges or many small I/O operations on shared file systems.

There are two common patterns to handle this separation. One pattern is to develop and build images on a local workstation or cloud VM where you have root privileges and fast local storage. For example, you can use Docker to build an image, test it locally, and then convert it to an Apptainer image using a command such as apptainer build. The resulting .sif file is then copied to the cluster and executed in your jobs.

Another pattern is to use rootless builds that are supported directly on the cluster. With Apptainer this can sometimes be done using definitions that avoid operations requiring root. The build then happens entirely within your user namespace. This approach is attractive if you do not have access to a development machine, but it may have limitations, for example restricted access to certain base images or longer build times on shared file systems.

Regardless of where you build, testing should happen in an environment that is as similar as possible to the production cluster. This includes the same CPU architecture, similar MPI stack, and similar file system behavior. Subtle differences in the host kernel or hardware can cause performance or compatibility surprises even if the image itself is identical.

In HPC, treat image build and image execution as separate workflows. Build and test on suitable systems, then run images on production clusters under scheduler control.

Containers and MPI in HPC

Parallel applications in HPC often use MPI to communicate between processes and nodes. Running MPI inside containers introduces additional complexity because there are two MPI layers involved. One MPI implementation may be present inside the container image, and another may be available on the host system through modules.

The simplest and often most robust pattern is to use the MPI library provided by the host, and ensure that the container environment is compatible with it. In this model, the mpirun or srun command is provided on the host, and each MPI process launches a container that contains the application and user space libraries but not its own conflicting MPI implementation. The host MPI then communicates across nodes using the high performance interconnect and uses its own drivers and configuration.

An alternative model is to bundle MPI inside the container image. This can work, but only if the MPI library is ABI compatible with the host drivers and interconnects. In practice, this often means using a container image that has been prepared specifically for a given cluster, or using an MPI library compiled with the right configuration for the target network.

It is common for HPC centers to provide example container recipes or officially supported images that demonstrate the recommended MPI integration pattern. Following these examples helps avoid subtle problems, such as mismatched MPI versions or loss of performance due to fallback to TCP over Ethernet when InfiniBand is not correctly exposed.

Access to GPUs and accelerators

Running GPU accelerated applications inside containers introduces additional concerns. The container must see the GPU devices and use the correct driver stack. At the same time, performance must remain close to native execution without a container.

In practice, GPU access with containers in HPC is usually achieved by mounting parts of the host driver stack into the container at runtime. For example, with Apptainer you might use runtime flags that automatically bind the host GPU libraries and device files into the container. The image itself then contains only the user space CUDA or other accelerator libraries, but not the kernel drivers.

This split is important. The host kernel and its drivers are still in control of the hardware. The container provides the application and some user space libraries, but it cannot and should not attempt to replace the host drivers. This model preserves both security and performance, because the same kernel level driver stack is used as in non container execution.

When designing a GPU enabled container image, you must pay careful attention to library versions. The CUDA or ROCm versions inside the container must be compatible with the drivers installed on the cluster nodes. Many GPU vendors provide base container images with well defined version combinations. HPC users typically start from these base images and then add their application and dependencies.

File systems, I/O, and data management with containers

Containers are often associated with immutable images and small local file systems. In HPC, applications frequently need to read and write large datasets on networked parallel file systems. A container that does not see these file systems would be useless for production runs.

HPC oriented runtimes are designed to integrate with existing storage. When you run a container in such an environment, important directories from the host, such as your home directory or project space, are usually bind mounted into the container by default. This means that paths like /home/username or the project directory on a Lustre or GPFS system are visible inside the container in the same way as outside. From the application perspective, input and output paths remain unchanged.

You, as the user, may also choose to bind mount additional directories into the container. This is particularly useful if you want to separate the image, which may live in one location, from large datasets stored elsewhere. Binding is specified at container runtime, not at image build time, which provides flexibility in how and where you run the same image.

One important detail in HPC settings is the location of the image file itself. Placing large numbers of container images on shared file systems can stress metadata servers and degrade performance. Many centers therefore recommend storing frequently used images on local scratch storage or on a dedicated area optimized for such files. Consult the documentation of the specific cluster to see any local policies or performance recommendations.

Security considerations for containers in shared environments

Containers often create a false impression of full isolation. In an HPC context, you must understand that containers are not virtual machines. They share the kernel with the host, and in most configurations containerized processes have similar access rights to host resources as normal user processes. The container runtime adds some additional isolation, but it is not a strict security boundary.

This has two consequences. From the system side, administrators configure runtimes to prevent escalation of privileges and unsafe operations. For example, mapping root inside the container to root on the host would be too permissive. HPC container runtimes are often configured so that the user inside the container maps directly to the user on the host. Any actions within the container are constrained by your usual permissions.

From your side as a user, you must treat container images as you would any other software. Do not run arbitrary images from untrusted sources on production clusters. If you use public images, such as those from container registries, you should inspect or at least understand what they contain. Minimal images that add only the necessary dependencies are easier to audit than large general purpose ones.

Another security aspect is network access. Some clusters restrict outbound network connections from compute nodes to protect both the site and external systems. Your container will inherit these restrictions. For example, an application that tries to download data from the internet at runtime may fail if outbound connectivity is disabled, regardless of the container configuration.

Workflow patterns for using containers in HPC

To use containers effectively in HPC, it is useful to adopt a repeatable workflow that fits with batch scheduling and project lifecycles. A common pattern includes several stages.

In the development and prototyping stage, you define a container recipe that describes the base image, system packages, and application build steps. You test this image on a local machine or development cluster, ideally with a similar CPU and GPU architecture. Unit tests and small regression tests are executed within the container to ensure that the application behaves as expected.

In the integration stage, you copy the image to the production cluster and adjust your job scripts. Typically, you still load some modules, such as the MPI module or a specific GPU toolkit, and then call the container runtime from within the job. You verify that input and output paths, environment variables, and resource requests interact correctly with the containerized application.

In the production stage, you freeze a particular image version for a project or publication. This version is then referenced by tag or checksum in documentation and scripts. Combined with version controlled input decks and job scripts, this practice greatly improves the reproducibility of large scale runs. If you need to update the application or dependencies, you create a new image version and record the change.

Finally, in the archiving stage, you may decide to store specific container images together with important simulation data and analysis scripts. Even years later, the image provides a record of the software stack used during the original runs. While hardware and host operating systems will change, a carefully chosen container image can often be re executed on newer systems with compatible kernels.

Limitations and performance considerations

Containers in HPC are powerful, but they are not a universal solution to every software problem. They introduce some limitations and potential overheads that must be understood.

From a performance perspective, the main cost of containers is not in CPU or GPU execution. Because the same kernel and drivers are used, raw compute performance is usually very close to native execution. The overhead is more likely to arise from file system accesses during container startup, certain bind mount operations, or from misconfigured MPI or network settings. In many well configured environments, this overhead is small compared to the run time of a large simulation, but it can still be noticeable for short jobs or workflows that launch thousands of very small tasks.

A more significant limitation is that containers alone cannot magically fix incompatibilities with the host kernel or hardware. For example, if an application relies on a very new system call or GPU feature that is not supported by the cluster's kernel or drivers, no container image can add that capability. Containers abstract the user space, not the kernel.

Another limitation is the complexity of debugging inside containers. Traditional debuggers and profilers must be integrated with the container environment. Many modern tools support this scenario, but there is sometimes additional configuration work. You may need to bind mount special directories used by the debugger, or ensure that the right performance counters are available.

Finally, images can become large and unwieldy if not managed carefully. Installing compilers, build tools, and multiple versions of libraries in the same image can lead to gigabyte scale image files, which can be slow to transfer between systems. A good practice is to separate build images, which contain compilers and tools, from runtime images that contain only the minimal set of dependencies needed to run the application.

Summary

Containers in HPC provide a practical way to package complex software environments and move them between systems while retaining high performance and integration with schedulers, MPI, GPUs, and parallel file systems. They differ from typical cloud or desktop container use because they must operate securely in multi user environments without root, and they must preserve access to specialized hardware.

To use containers effectively in HPC, you must understand the supported runtime on your cluster, manage image building separately from production runs, integrate with host MPI and GPU drivers, and design workflows that capture image versions alongside job scripts for reproducibility. When used with these constraints in mind, containers become a powerful component of reproducible and portable HPC workflows.

Comments

Please login to add a comment.

Don't have an account? Register now!