Skip to content

A k8s infrastructure setting for ControlArena

Notifications You must be signed in to change notification settings

metatrot/ca-k8s-infra

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Training Setup

This repository contains Infrastructure as Code (IaC) and code definitions for training ML models using Ray on Kubernetes. The setup includes distributed training capabilities and secure access to model weights and datasets through MinIO.

Prerequisites

  • Access to the Kubernetes cluster
  • kubectl configured with the correct context
  • uv package manager installed

Infrastructure

The training infrastructure consists of:

  • A Ray cluster with 1 head node and 2 worker nodes
  • Cilium Network Policies for secure communication
  • Integration with internal MinIO for model weights and datasets storage

Usage

Installation

Install dependencies:

make install

Deployment

Deploy the Ray cluster and network policies:

make deploy

To refresh the deployment (delete and redeploy):

make refresh

You should run this if the ray worker image has been updated.

Training

Start a training job:

make train

This command will:

  • Install dependencies
  • Set up port forwarding to the Ray cluster
  • Run the training script
  • Automatically clean up port forwarding on completion
  • Port Forwarding
  • Manually manage port forwarding:

Start port forwarding

make start_port_forward

This will make the Ray dashboard accessible on localhost:8265 and the Hubble dashboard accessible on localhost:8080.

Stop port forwarding

make stop_port_forward

Storage

Model weights and datasets are stored in the internal MinIO deployment, accessible at minio.minio-internal.svc.cluster.local. The training setup includes utilities for copying data between S3 and local storage:

fs = pyarrow.fs.S3FileSystem(
    access_key="minioadmin",
    secret_key="minioadmin",
    endpoint_override="minio.minio-internal.svc.cluster.local:9000",
    scheme="http"
)

Docker Images

All dockerfiles in the training-infra/dockerfiles repo are built externally and pushed with the name of the dockerfile. Any updates to these files in a PR will automatically cause the images to be built and pushed. They can then be accessed with e.g. localhost:5001/<dockerfile_name>:{{ .Values.environment }} by helm templates.

Developer Notes

The training script automatically handles data and model transfer between MinIO and local storage.

About

A k8s infrastructure setting for ControlArena

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 77.9%
  • Makefile 14.0%
  • Dockerfile 8.1%