🚀 ML Project Template

A modern template for machine learning experimentation using wandb, hydra-zen, and submitit on a Slurm cluster with Docker/Apptainer containerization.

Note: This template is optimized for the ML Group cluster setup but can be easily adapted to similar environments.

✨ Key Features

📦 Python environment in Docker via uv
📊 Logging and visualizations via Weights and Biases
🧩 Reproducibility and modular type-checked configs via hydra-zen
🖥️ Submit Slurm jobs and parameter sweeps directly from Python via submitit
🔄 No .def or .sh files needed for Apptainer/Slurm

📋 Table of Contents

🔑 Container Registry Authentication
🐳 Container Setup
- Option 1: Apptainer (Cluster)
- Option 2: Docker (Local Machine)
📦 Package Management
🛠️ Development Notes
🧪 Running Experiments
👥 Contributions
🙏 Acknowledgements

🔑 Container Registry Authentication

Generate Token

Create a new GitHub token at Settings → Developer settings → Personal access tokens with:
- read:packages permission
- write:packages permission

Log In

With Apptainer:

apptainer remote login --username <your GitHub username> docker://ghcr.io

With Docker:

docker login ghcr.io -u <your GitHub username>

When prompted, enter your token as the password.

🐳 Container Setup

Choose one of the following methods to set up your environment:

Option 1: Apptainer (Cluster)

Install VSCode Remote Tunnels Extension

First, install the Remote Tunnels extension in VSCode.

Connect to compute resources

For CPU resources:

srun --partition=cpu-2h --pty bash

For GPU resources:

srun --partition=gpu-2h --gpus-per-task=1 --pty bash

Launch container

To open a tunnel to connect your local VSCode to the container on the cluster:
```
apptainer run --nv --writable-tmpfs oras://ghcr.io/marvinsxtr/ml-project-template:latest-sif code tunnel
```
💡 You can specify a version tag (e.g., v0.0.1) instead of latest. Available versions are listed at GitHub Container Registry.

In VSCode press Shift+Alt+P (Windows/Linux) or Shift+Cmd+P (Mac), type "connect to tunnel", select GitHub and select your named node on the cluster. Your IDE is now connected to the cluster.

To open a shell in the container on the cluster:
```
apptainer run --nv --writable-tmpfs oras://ghcr.io/marvinsxtr/ml-project-template:latest-sif /bin/bash
```
💡 This may take a few minutes on the first run as the container image is downloaded.

Option 2: Docker (Local Machine)

Install VSCode Dev Containers Extension

First, install the Dev Containers extension in VSCode.
Open the Repository in the Dev Container

Click the Reopen in Container button in the pop-up that appears once you open the repository in VSCode.

Alternatively, open the command palette in VSCode by pressing Shift+Alt+P (Windows/Linux) or Shift+Cmd+P (Mac), and type Dev Containers: Reopen in Container.

Using Slurm within Apptainer

In order to access Slurm with submitit from within the container, you first need to set up passwordless SSH to the login node.

On the cluster, create a new SSH key pair in case you don't have one yet

ssh-keygen -t ed25519 -C "[email protected]"

and add your public key to the authorized_keys:

cat ~/.ssh/id_ed25519.pub >> ~/.ssh/authorized_keys

You can verify that this works by running

ssh $USER@$HOST exit

which should return without any prompt.

📦 Package Management

Update dependencies

This project uses uv for Python dependency management.

Inside the container (!):

# Add a specific package
uv add <package-name>

# Update all dependencies from pyproject.toml
uv sync

Commit changes to the repository:

Use tags for versioning:

git add pyproject.toml uv.lock 
git commit -m "Updated dependencies"
git tag v0.0.1
git push && git push --tags

Use the updated image:

The GitHub Actions workflow automatically builds a new image when changes are pushed.

With Apptainer:

apptainer run --nv --writable-tmpfs oras://ghcr.io/marvinsxtr/ml-project-template:v0.0.1-sif /bin/bash

With Docker:

docker run -it --rm --platform=linux/amd64 ghcr.io/marvinsxtr/ml-project-template:v0.0.1 /bin/bash

🛠️ Development Notes

Building Locally for Testing

Test your Dockerfile locally before pushing:

docker buildx build -t ml-project-template .

Run the container directly with:

docker run -it --rm --platform=linux/amd64 ml-project-template /bin/bash

🧪 Running Experiments

WandB Logging

Logging to WandB is optional for local jobs but mandatory for jobs submitted to the cluster.

Create a .env file in the root of the repository with:

WANDB_API_KEY=your_api_key
WANDB_ENTITY=your_entity
WANDB_PROJECT=your_project_name

Example Project

The folder example contains an example project which can serve as a starting point for ML experimentation. Configuring a function

from ml_project_template.utils import logger

def main(foo: int = 42, bar: int = 3) -> None:
    """Run a main function from a config."""
    logger.info(f"Hello World! cfg={cfg}, bar={bar}, foo={foo}")

if __name__ == "__main__":
    main()

is as easy as adding (1) a Run as the first argument, (2) importing the config stores and (3) wrapping the main function with run:

from ml_project_template.config import run
from ml_project_template.runs import Run
from ml_project_template.utils import logger

def main(cfg: Run, foo: int = 42, bar: int = 3) -> None:
    """Run a main function from a config."""
    logger.info(f"Hello World! cfg={cfg}, bar={bar}, foo={foo}")

if __name__ == "__main__":
    from example import stores  # noqa: F401
    run(main)

You can try running this example with:

python example/main.py

Hydra will automatically generate a config.yaml in the outputs/<date>/<time>/.hydra folder which you can use to reproduce the same run later.

Try overriding the values passed to the main function and see how it changes the output (config):

python example/main.py foo=123

Reproduce the results of a previous run/config:

python example/main.py -cp outputs/<date>/<time>/.hydra -cn config.yaml

Enabling WandB logging:

python example/main.py cfg/wandb=base

Run WandB in offline mode:

python example/main.py cfg/wandb=base cfg.wandb.mode=offline

Single Job

Run a job on the cluster:

python example/main.py cfg/job=base

This will automatically enable WandB logging. See example/configs.py to configure the job settings.

Distributed Sweep

Run a parameter sweep over multiple seeds using multiple nodes:

python example/main.py cfg/job=sweep

This will automatically enable WandB logging. See example/configs.py to configure sweep parameters.

👥 Contributions

Contributions to this documentation and template are very welcome! Feel free to open a PR or reach out with suggestions.

🙏 Acknowledgements

This template is based on a previous example project.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github/workflows		.github/workflows
example		example
ml_project_template		ml_project_template
.devcontainer.json		.devcontainer.json
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 ML Project Template

✨ Key Features

📋 Table of Contents

🔑 Container Registry Authentication

Generate Token

Log In

🐳 Container Setup

Option 1: Apptainer (Cluster)

Option 2: Docker (Local Machine)

Using Slurm within Apptainer

📦 Package Management

🛠️ Development Notes

Building Locally for Testing

🧪 Running Experiments

WandB Logging

Example Project

Single Job

Distributed Sweep

👥 Contributions

🙏 Acknowledgements

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages

License

marvinsxtr/ml-project-template

Folders and files

Latest commit

History

Repository files navigation

🚀 ML Project Template

✨ Key Features

📋 Table of Contents

🔑 Container Registry Authentication

Generate Token

Log In

🐳 Container Setup

Option 1: Apptainer (Cluster)

Option 2: Docker (Local Machine)

Using Slurm within Apptainer

📦 Package Management

🛠️ Development Notes

Building Locally for Testing

🧪 Running Experiments

WandB Logging

Example Project

Single Job

Distributed Sweep

👥 Contributions

🙏 Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages

Packages