This repository contains Infrastructure as Code (IaC) and code definitions for training ML models using Ray on Kubernetes. The setup includes distributed training capabilities and secure access to model weights and datasets through MinIO.
- Access to the Kubernetes cluster
kubectl
configured with the correct contextuv
package manager installed
The training infrastructure consists of:
- A Ray cluster with 1 head node and 2 worker nodes
- Cilium Network Policies for secure communication
- Integration with internal MinIO for model weights and datasets storage
Install dependencies:
make install
Deploy the Ray cluster and network policies:
make deploy
To refresh the deployment (delete and redeploy):
make refresh
You should run this if the ray worker image has been updated.
Start a training job:
make train
This command will:
- Install dependencies
- Set up port forwarding to the Ray cluster
- Run the training script
- Automatically clean up port forwarding on completion
- Port Forwarding
- Manually manage port forwarding:
make start_port_forward
This will make the Ray dashboard accessible on localhost:8265
and the Hubble dashboard accessible on localhost:8080
.
make stop_port_forward
Model weights and datasets are stored in the internal MinIO deployment, accessible at minio.minio-internal.svc.cluster.local. The training setup includes utilities for copying data between S3 and local storage:
fs = pyarrow.fs.S3FileSystem(
access_key="minioadmin",
secret_key="minioadmin",
endpoint_override="minio.minio-internal.svc.cluster.local:9000",
scheme="http"
)
All dockerfiles in the training-infra/dockerfiles
repo are built externally and pushed with the name of the dockerfile. Any updates to these files in a PR will automatically cause the images to be built and pushed. They can then be accessed with e.g. localhost:5001/<dockerfile_name>:{{ .Values.environment }}
by helm templates.
The training script automatically handles data and model transfer between MinIO and local storage.