Apptainer for HPC
1. From Docker to Apptainer
Clusters rarely allow Docker because it requires root privileges, which is a massive security risk in a shared environment. Enter Apptainer (formerly Singularity). It is the standard for containerization in HPC.
1.1. The Landscape (Docker vs. Conda vs. Apptainer)
Before diving into commands, let's clarify where Apptainer fits in.
| Feature | Conda | Docker | Apptainer |
| Primary Scope | Python packages & binaries | Full System (OS + Libs) | Full System (OS + Libs) |
| Privilege | User level (No root) | Root required (Daemon) | User level (No root) |
| Isolation | Weak (Library path manipulation) | Strong (Network/FS isolation) | Integrated (Shares Net/FS) |
| File Format | Directory of files | Layered Images | Single File (.sif) |
- Why not just Conda? Conda is great, but it relies on the host's system libraries (like
glibc). If the cluster's OS is too old, your Conda environment might fail. Apptainer brings its own OS, solving this. - Why not Docker? We don't have sudo access on the cluster, so we can't use Docker.
1.2. How It Actually Works
To use Apptainer effectively, you must understand two concepts: OS Capabilities and Bind Mounts.
1.2.1. User Space vs. Kernel Space
Apptainer does not virtualize hardware. It shares the Host Kernel, but swaps out the User Space.
- Can I change the OS? Yes. You can run Ubuntu 22.04 on a CentOS 7 host. Apptainer replaces directories like
/bin,/usr, and/etc. - Can I change the GPU Driver? No.
- NVIDIA Driver (Kernel Space): Must be installed by the cluster admin on the physical host. Apptainer cannot change this.
- CUDA Toolkit (User Space): Apptainer can change this.
- Implication: You can use any CUDA Toolkit version (e.g., 11.8, 12.1) inside the container, as long as the physical host driver is new enough to support it.
1.2.2. Bind Mounts (The Portal)
By default, a container is a sealed box. "Bind Mounting" opens a window between the physical host and the container.
- Syntax:
--bind /host/path:/container/path - Why do we need it? If your data is on the cluster's shared storage (
/mnt/shared), the container cannot see it unless you bind it. - Best Practice: Map the paths identically (e.g.,
--bind /mnt:/mnt). This way, your file paths in your code work exactly the same whether running locally or inside the container.
1.3. The Workflow
We will use a "Sandbox First, SIF Later" workflow. This is the most flexible way to work on clusters.
1.3.1. Download & Convert (The Sandbox)
We don't want a read-only file yet. We want a writable folder to install packages. We use the --sandbox flag.
1# Syntax: apptainer build --sandbox [TARGET_DIR] [SOURCE]2apptainer build --sandbox --fakeroot /mnt/shared/username/envs/my_project docker://pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
- Result: You now have a folder named
my_projectcontaining the entire OS file structure.
1.3.2. Modify & Install Dependencies
Now, we enter the container shell to install custom libraries (like transformers or specific pip packages).
1# Enter the shell in Writable mode2apptainer shell \3 --writable \4 --fakeroot \5 --nv \6 --bind /mnt:/mnt \7 /mnt/shared/username/envs/my_project
--writable: Crucial! Allows you to modify the sandbox folder.--nv: Enables GPU support (passes the host driver into the container).
Inside the container:
1Apptainer> pip install transformers2Apptainer> pip install flash-attn --no-build-isolation3Apptainer> exit
Note: Changes are persisted in the my_project folder.
1.3.3. Run Your Code
Once dependencies are installed, you don't need the interactive shell. You can "throw" commands into the container using exec.
1# Run a specific command without entering the shell2apptainer exec \3 --nv \4 --bind /mnt:/mnt \5 /mnt/shared/username/envs/my_project \6 python /mnt/shared/username/code/train.py
Note: We removed --writable here. It's safer to run experiments in read-only mode.
1.3.4. Package (Production)
Sandboxes (folders) contain thousands of small files, which can be slow on networked storage. For long-term usage, package it into a single .sif file.
1# Convert the Sandbox folder into a compressed SIF file2apptainer build /mnt/shared/username/envs/final_image.sif /mnt/shared/username/envs/my_project
You can now run final_image.sif exactly like you ran the folder in Step 3.
1.4. Running on Slurm
When using a scheduler like Slurm (srun or sbatch), you need to be careful about environment variables and networking.
1.4.1. The Command Structure
The hierarchy of execution looks like this: Slurm (Resource Allocation) -> Apptainer (Environment) -> Python (Process)
1# Example Slurm Command2srun -p gpu_partition --gres=gpu:1 --export=ALL \3 apptainer exec --nv --bind /mnt:/mnt \4 /mnt/shared/username/envs/final_image.sif \5 python train.py
1.4.2. Environment Variables
Does your current terminal configuration (.bashrc or export VAR=...) pass into the container?
- Usually, yes. Apptainer tries to inherit variables.
- The Trap: Slurm might strip them before they reach the compute node.
- The Fix: Always use
--export=ALLin yoursruncommand to ensure your proxy settings and API keys travel with the job.