A modern template for machine learning experimentation using wandb, hydra-zen, and submitit on a Slurm cluster with Docker/Apptainer containerization.
Note: This template is optimized for the ML Group cluster setup but can be easily adapted to similar environments.
- π¦ Python environment in Docker via uv
- π Logging and visualizations via Weights and Biases
- π§© Reproducibility and modular type-checked configs via hydra-zen
- π₯οΈ Submit Slurm jobs and parameter sweeps directly from Python via submitit
- π No
.def
or.sh
files needed for Apptainer/Slurm
- π Container Registry Authentication
- π³ Container Setup
- π¦ Package Management
- π οΈ Development Notes
- π§ͺ Running Experiments
- π₯ Contributions
- π Acknowledgements
- Create a new GitHub token at Settings β Developer settings β Personal access tokens with:
read:packages
permissionwrite:packages
permission
With Apptainer:
apptainer remote login --username <your GitHub username> docker://ghcr.io
With Docker:
docker login ghcr.io -u <your GitHub username>
When prompted, enter your token as the password.
Choose one of the following methods to set up your environment:
-
Install VSCode Remote Tunnels Extension
First, install the Remote Tunnels extension in VSCode.
-
Connect to compute resources
For CPU resources:
srun --partition=cpu-2h --pty bash
For GPU resources:
srun --partition=gpu-2h --gpus-per-task=1 --pty bash
-
Launch container
To open a tunnel to connect your local VSCode to the container on the cluster:
apptainer run --nv --writable-tmpfs oras://ghcr.io/marvinsxtr/ml-project-template:latest-sif code tunnel
π‘ You can specify a version tag (e.g.,
v0.0.1
) instead oflatest
. Available versions are listed at GitHub Container Registry.In VSCode press
Shift+Alt+P
(Windows/Linux) orShift+Cmd+P
(Mac), type "connect to tunnel", select GitHub and select your named node on the cluster. Your IDE is now connected to the cluster.To open a shell in the container on the cluster:
apptainer run --nv --writable-tmpfs oras://ghcr.io/marvinsxtr/ml-project-template:latest-sif /bin/bash
π‘ This may take a few minutes on the first run as the container image is downloaded.
-
Install VSCode Dev Containers Extension
First, install the Dev Containers extension in VSCode.
-
Open the Repository in the Dev Container
Click the
Reopen in Container
button in the pop-up that appears once you open the repository in VSCode.Alternatively, open the command palette in VSCode by pressing
Shift+Alt+P
(Windows/Linux) orShift+Cmd+P
(Mac), and typeDev Containers: Reopen in Container
.
In order to access Slurm with submitit from within the container, you first need to set up passwordless SSH to the login node.
On the cluster, create a new SSH key pair in case you don't have one yet
ssh-keygen -t ed25519 -C "[email protected]"
and add your public key to the authorized_keys
:
cat ~/.ssh/id_ed25519.pub >> ~/.ssh/authorized_keys
You can verify that this works by running
ssh $USER@$HOST exit
which should return without any prompt.
-
Update dependencies
This project uses uv for Python dependency management.
Inside the container (!):
# Add a specific package uv add <package-name> # Update all dependencies from pyproject.toml uv sync
-
Commit changes to the repository:
Use tags for versioning:
git add pyproject.toml uv.lock git commit -m "Updated dependencies" git tag v0.0.1 git push && git push --tags
-
Use the updated image:
The GitHub Actions workflow automatically builds a new image when changes are pushed.
With Apptainer:
apptainer run --nv --writable-tmpfs oras://ghcr.io/marvinsxtr/ml-project-template:v0.0.1-sif /bin/bash
With Docker:
docker run -it --rm --platform=linux/amd64 ghcr.io/marvinsxtr/ml-project-template:v0.0.1 /bin/bash
Test your Dockerfile locally before pushing:
docker buildx build -t ml-project-template .
Run the container directly with:
docker run -it --rm --platform=linux/amd64 ml-project-template /bin/bash
Logging to WandB is optional for local jobs but mandatory for jobs submitted to the cluster.
Create a .env
file in the root of the repository with:
WANDB_API_KEY=your_api_key
WANDB_ENTITY=your_entity
WANDB_PROJECT=your_project_name
The folder example
contains an example project which can serve as a starting point for ML experimentation. Configuring a function
from ml_project_template.utils import logger
def main(foo: int = 42, bar: int = 3) -> None:
"""Run a main function from a config."""
logger.info(f"Hello World! cfg={cfg}, bar={bar}, foo={foo}")
if __name__ == "__main__":
main()
is as easy as adding (1) a Run
as the first argument, (2) importing the config stores and (3) wrapping the main
function with run
:
from ml_project_template.config import run
from ml_project_template.runs import Run
from ml_project_template.utils import logger
def main(cfg: Run, foo: int = 42, bar: int = 3) -> None:
"""Run a main function from a config."""
logger.info(f"Hello World! cfg={cfg}, bar={bar}, foo={foo}")
if __name__ == "__main__":
from example import stores # noqa: F401
run(main)
You can try running this example with:
python example/main.py
Hydra will automatically generate a config.yaml
in the outputs/<date>/<time>/.hydra
folder which you can use to reproduce the same run later.
Try overriding the values passed to the main
function and see how it changes the output (config):
python example/main.py foo=123
Reproduce the results of a previous run/config:
python example/main.py -cp outputs/<date>/<time>/.hydra -cn config.yaml
Enabling WandB logging:
python example/main.py cfg/wandb=base
Run WandB in offline mode:
python example/main.py cfg/wandb=base cfg.wandb.mode=offline
Run a job on the cluster:
python example/main.py cfg/job=base
This will automatically enable WandB logging. See example/configs.py
to configure the job settings.
Run a parameter sweep over multiple seeds using multiple nodes:
python example/main.py cfg/job=sweep
This will automatically enable WandB logging. See example/configs.py
to configure sweep parameters.
Contributions to this documentation and template are very welcome! Feel free to open a PR or reach out with suggestions.
This template is based on a previous example project.