Lightweight distributed training for multi-dataset workflows on local hardware.
_
| |__ _ _ _ __ __ ___ __
| '_ \| | | | '__/ _` \ \/ /
| | | | |_| | | | (_| |> <
|_| |_|\__, |_| \__,_/_/\_\
|___/
Hyrax enables concurrent training of models across multiple datasets on local multi-GPU setups without the complexity of Kubernetes or distributed computing frameworks. It automatically detects available hardware, intelligently schedules jobs, and monitors training progress.
Key features:
- Automatic GPU detection and allocation (CUDA, MPS, CPU)
- Intelligent job scheduling with bin-packing optimization
- Real-time training monitoring via TensorBoard
- Dataset-agnostic interface (Minari, HDF5, pickle, custom loaders)
- Zero-configuration deployment for local machines
- Offline-first design
pip install hyrax-libRequirements:
- Python 3.8+
- PyTorch 2.0+
- CUDA-capable GPU (optional, CPU and Apple Silicon supported)
Train a behavioral cloning model on three MuJoCo datasets concurrently:
from hyrax import DistributedTrainer
import minari
trainer = DistributedTrainer(
model=BehavioralCloningModel,
datasets=[
"mujoco/humanoid/expert-v0",
"mujoco/halfcheetah/expert-v0",
"mujoco/hopper/expert-v0"
],
dataset_loader=minari.load_dataset,
)
results = trainer.train(epochs=200)Hyrax automatically:
- Detects available GPUs and memory
- Schedules jobs to minimize resource contention
- Distributes datasets across workers
- Monitors training progress
- Returns aggregated results
from hyrax import DistributedTrainer
trainer = DistributedTrainer(
model=YourModel,
datasets=["dataset1", "dataset2", "dataset3"]
)
results = trainer.train(epochs=100)def load_custom_data(path):
# your loading logic
return dataset
trainer = DistributedTrainer(
model=YourModel,
datasets=["path/to/data1", "path/to/data2"],
dataset_loader=load_custom_data
)datasets = [load_data(x) for x in paths]
trainer = DistributedTrainer(
model=YourModel,
datasets=datasets # already loaded
)Provide memory estimates for better scheduling:
trainer = DistributedTrainer(
model=YourModel,
datasets=datasets,
job_size_estimates=[2*1024**3, 3*1024**3, 2*1024**3] # bytes
)Hyrax automatically logs training metrics to TensorBoard:
tensorboard --logdir=runsNavigate to http://localhost:6006 to view real-time training progress across all workers.
Hyrax consists of four main components:
- ResourceManager: Detects GPUs, CPUs, and available memory
- LoadBalancer: Schedules jobs using bin-packing optimization
- TrainingWorker: Executes training on assigned hardware
- TrainingMonitor: Logs metrics and progress via TensorBoard
- CUDA: NVIDIA GPUs
- MPS: Apple Silicon (M1/M2/M3)
- CPU: Fallback for systems without GPUs
Good for:
- Training the same model on multiple datasets simultaneously
- Local multi-GPU workstations
- Offline training environments
- Rapid prototyping and experimentation
Not suitable for:
- Multi-node distributed training (use Ray, DeepSpeed, or Kubernetes)
- Model parallelism across GPUs
- Production inference serving
See examples/ for complete working examples:
mujoco_example.py: Behavioral cloning with Minari datasetscustom_dataset_example.py: Using custom data loadersbasic_usage.py: Minimal example
Contributions welcome! Please feel free to submit a Pull Request.
MIT License - see LICENSE for details.
If you use Hyrax in your research, please cite:
@software{hyrax2026,
author = {Hothi, Baljinder},
title = {Hyrax: Lightweight Distributed Training for Local Hardware},
year = {2026},
url = {https://github.com/BaljinderHothi/hyrax-lib}
}Named after the rock hyrax, a small mammal thats really cute