Apptainer for HPC

1. From Docker to Apptainer

Clusters rarely allow Docker because it requires root privileges, which is a massive security risk in a shared environment. Enter Apptainer (formerly Singularity). It is the standard for containerization in HPC.

1.1. The Landscape (Docker vs. Conda vs. Apptainer)

Before diving into commands, let's clarify where Apptainer fits in.

Feature Conda Docker Apptainer
Primary Scope Python packages & binaries Full System (OS + Libs) Full System (OS + Libs)
Privilege User level (No root) Root required (Daemon) User level (No root)
Isolation Weak (Library path manipulation) Strong (Network/FS isolation) Integrated (Shares Net/FS)
File Format Directory of files Layered Images Single File (.sif)
  • Why not just Conda? Conda is great, but it relies on the host's system libraries (like glibc). If the cluster's OS is too old, your Conda environment might fail. Apptainer brings its own OS, solving this.
  • Why not Docker? We don't have sudo access on the cluster, so we can't use Docker.

1.2. How It Actually Works

To use Apptainer effectively, you must understand two concepts: OS Capabilities and Bind Mounts.

1.2.1. User Space vs. Kernel Space

Apptainer does not virtualize hardware. It shares the Host Kernel, but swaps out the User Space.

  • Can I change the OS? Yes. You can run Ubuntu 22.04 on a CentOS 7 host. Apptainer replaces directories like /bin, /usr, and /etc.
  • Can I change the GPU Driver? No.
  • NVIDIA Driver (Kernel Space): Must be installed by the cluster admin on the physical host. Apptainer cannot change this.
  • CUDA Toolkit (User Space): Apptainer can change this.
  • Implication: You can use any CUDA Toolkit version (e.g., 11.8, 12.1) inside the container, as long as the physical host driver is new enough to support it.

1.2.2. Bind Mounts (The Portal)

By default, a container is a sealed box. "Bind Mounting" opens a window between the physical host and the container.

  • Syntax: --bind /host/path:/container/path
  • Why do we need it? If your data is on the cluster's shared storage (/mnt/shared), the container cannot see it unless you bind it.
  • Best Practice: Map the paths identically (e.g., --bind /mnt:/mnt). This way, your file paths in your code work exactly the same whether running locally or inside the container.

1.3. The Workflow

We will use a "Sandbox First, SIF Later" workflow. This is the most flexible way to work on clusters.

1.3.1. Download & Convert (The Sandbox)

We don't want a read-only file yet. We want a writable folder to install packages. We use the --sandbox flag.

1
# Syntax: apptainer build --sandbox [TARGET_DIR] [SOURCE]
2
apptainer build --sandbox --fakeroot /mnt/shared/username/envs/my_project docker://pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
  • Result: You now have a folder named my_project containing the entire OS file structure.

1.3.2. Modify & Install Dependencies

Now, we enter the container shell to install custom libraries (like transformers or specific pip packages).

1
# Enter the shell in Writable mode
2
apptainer shell \
3
--writable \
4
--fakeroot \
5
--nv \
6
--bind /mnt:/mnt \
7
/mnt/shared/username/envs/my_project
  • --writable: Crucial! Allows you to modify the sandbox folder.
  • --nv: Enables GPU support (passes the host driver into the container).

Inside the container:

1
Apptainer> pip install transformers
2
Apptainer> pip install flash-attn --no-build-isolation
3
Apptainer> exit

Note: Changes are persisted in the my_project folder.

1.3.3. Run Your Code

Once dependencies are installed, you don't need the interactive shell. You can "throw" commands into the container using exec.

1
# Run a specific command without entering the shell
2
apptainer exec \
3
--nv \
4
--bind /mnt:/mnt \
5
/mnt/shared/username/envs/my_project \
6
python /mnt/shared/username/code/train.py

Note: We removed --writable here. It's safer to run experiments in read-only mode.

1.3.4. Package (Production)

Sandboxes (folders) contain thousands of small files, which can be slow on networked storage. For long-term usage, package it into a single .sif file.

1
# Convert the Sandbox folder into a compressed SIF file
2
apptainer build /mnt/shared/username/envs/final_image.sif /mnt/shared/username/envs/my_project

You can now run final_image.sif exactly like you ran the folder in Step 3.

1.4. Running on Slurm

When using a scheduler like Slurm (srun or sbatch), you need to be careful about environment variables and networking.

1.4.1. The Command Structure

The hierarchy of execution looks like this: Slurm (Resource Allocation) -> Apptainer (Environment) -> Python (Process)

1
# Example Slurm Command
2
srun -p gpu_partition --gres=gpu:1 --export=ALL \
3
apptainer exec --nv --bind /mnt:/mnt \
4
/mnt/shared/username/envs/final_image.sif \
5
python train.py

1.4.2. Environment Variables

Does your current terminal configuration (.bashrc or export VAR=...) pass into the container?

  • Usually, yes. Apptainer tries to inherit variables.
  • The Trap: Slurm might strip them before they reach the compute node.
  • The Fix: Always use --export=ALL in your srun command to ensure your proxy settings and API keys travel with the job.