Skip to content

marvinsxtr/ml-project-template

Repository files navigation

πŸš€ ML Project Template

A modern template for machine learning experimentation using wandb, hydra-zen, and submitit on a Slurm cluster with Docker/Apptainer containerization.

Note: This template is optimized for the ML Group cluster setup but can be easily adapted to similar environments.

Python 3.12 Docker WandB Hydra Zen Submitit

✨ Key Features

  • πŸ“¦ Python environment in Docker via uv
  • πŸ“Š Logging and visualizations via Weights and Biases
  • 🧩 Reproducibility and modular type-checked configs via hydra-zen
  • πŸ–₯️ Submit Slurm jobs and parameter sweeps directly from Python via submitit
  • πŸ”„ No .def or .sh files needed for Apptainer/Slurm

πŸ“‹ Table of Contents

πŸ”‘ Container Registry Authentication

Generate Token

  1. Create a new GitHub token at Settings β†’ Developer settings β†’ Personal access tokens with:
    • read:packages permission
    • write:packages permission

Log In

With Apptainer:

apptainer remote login --username <your GitHub username> docker://ghcr.io

With Docker:

docker login ghcr.io -u <your GitHub username>

When prompted, enter your token as the password.

🐳 Container Setup

Choose one of the following methods to set up your environment:

Option 1: Apptainer (Cluster)

  1. Install VSCode Remote Tunnels Extension

    First, install the Remote Tunnels extension in VSCode.

  2. Connect to compute resources

    For CPU resources:

    srun --partition=cpu-2h --pty bash

    For GPU resources:

    srun --partition=gpu-2h --gpus-per-task=1 --pty bash
  3. Launch container

    To open a tunnel to connect your local VSCode to the container on the cluster:

    apptainer run --nv --writable-tmpfs oras://ghcr.io/marvinsxtr/ml-project-template:latest-sif code tunnel

    πŸ’‘ You can specify a version tag (e.g., v0.0.1) instead of latest. Available versions are listed at GitHub Container Registry.

    In VSCode press Shift+Alt+P (Windows/Linux) or Shift+Cmd+P (Mac), type "connect to tunnel", select GitHub and select your named node on the cluster. Your IDE is now connected to the cluster.

    To open a shell in the container on the cluster:

    apptainer run --nv --writable-tmpfs oras://ghcr.io/marvinsxtr/ml-project-template:latest-sif /bin/bash

    πŸ’‘ This may take a few minutes on the first run as the container image is downloaded.

Option 2: Docker (Local Machine)

  1. Install VSCode Dev Containers Extension

    First, install the Dev Containers extension in VSCode.

  2. Open the Repository in the Dev Container

    Click the Reopen in Container button in the pop-up that appears once you open the repository in VSCode.

    Alternatively, open the command palette in VSCode by pressing Shift+Alt+P (Windows/Linux) or Shift+Cmd+P (Mac), and type Dev Containers: Reopen in Container.

Using Slurm within Apptainer

In order to access Slurm with submitit from within the container, you first need to set up passwordless SSH to the login node.

On the cluster, create a new SSH key pair in case you don't have one yet

ssh-keygen -t ed25519 -C "[email protected]"

and add your public key to the authorized_keys:

cat ~/.ssh/id_ed25519.pub >> ~/.ssh/authorized_keys

You can verify that this works by running

ssh $USER@$HOST exit

which should return without any prompt.

πŸ“¦ Package Management

  1. Update dependencies

    This project uses uv for Python dependency management.

    Inside the container (!):

    # Add a specific package
    uv add <package-name>
    
    # Update all dependencies from pyproject.toml
    uv sync
  2. Commit changes to the repository:

    Use tags for versioning:

    git add pyproject.toml uv.lock 
    git commit -m "Updated dependencies"
    git tag v0.0.1
    git push && git push --tags
  3. Use the updated image:

    The GitHub Actions workflow automatically builds a new image when changes are pushed.

    With Apptainer:

    apptainer run --nv --writable-tmpfs oras://ghcr.io/marvinsxtr/ml-project-template:v0.0.1-sif /bin/bash

    With Docker:

    docker run -it --rm --platform=linux/amd64 ghcr.io/marvinsxtr/ml-project-template:v0.0.1 /bin/bash

πŸ› οΈ Development Notes

Building Locally for Testing

Test your Dockerfile locally before pushing:

docker buildx build -t ml-project-template .

Run the container directly with:

docker run -it --rm --platform=linux/amd64 ml-project-template /bin/bash

πŸ§ͺ Running Experiments

WandB Logging

Logging to WandB is optional for local jobs but mandatory for jobs submitted to the cluster.

Create a .env file in the root of the repository with:

WANDB_API_KEY=your_api_key
WANDB_ENTITY=your_entity
WANDB_PROJECT=your_project_name

Example Project

The folder example contains an example project which can serve as a starting point for ML experimentation. Configuring a function

from ml_project_template.utils import logger

def main(foo: int = 42, bar: int = 3) -> None:
    """Run a main function from a config."""
    logger.info(f"Hello World! cfg={cfg}, bar={bar}, foo={foo}")

if __name__ == "__main__":
    main()

is as easy as adding (1) a Run as the first argument, (2) importing the config stores and (3) wrapping the main function with run:

from ml_project_template.config import run
from ml_project_template.runs import Run
from ml_project_template.utils import logger

def main(cfg: Run, foo: int = 42, bar: int = 3) -> None:
    """Run a main function from a config."""
    logger.info(f"Hello World! cfg={cfg}, bar={bar}, foo={foo}")

if __name__ == "__main__":
    from example import stores  # noqa: F401
    run(main)

You can try running this example with:

python example/main.py

Hydra will automatically generate a config.yaml in the outputs/<date>/<time>/.hydra folder which you can use to reproduce the same run later.

Try overriding the values passed to the main function and see how it changes the output (config):

python example/main.py foo=123

Reproduce the results of a previous run/config:

python example/main.py -cp outputs/<date>/<time>/.hydra -cn config.yaml

Enabling WandB logging:

python example/main.py cfg/wandb=base

Run WandB in offline mode:

python example/main.py cfg/wandb=base cfg.wandb.mode=offline

Single Job

Run a job on the cluster:

python example/main.py cfg/job=base

This will automatically enable WandB logging. See example/configs.py to configure the job settings.

Distributed Sweep

Run a parameter sweep over multiple seeds using multiple nodes:

python example/main.py cfg/job=sweep

This will automatically enable WandB logging. See example/configs.py to configure sweep parameters.

πŸ‘₯ Contributions

Contributions to this documentation and template are very welcome! Feel free to open a PR or reach out with suggestions.

πŸ™ Acknowledgements

This template is based on a previous example project.

About

Template machine learning project using wandb, hydra-zen and submitit on Slurm with Apptainer

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors 3

  •  
  •  
  •