Releases: ErenAri/HCP-workload-optimizer
Releases · ErenAri/HCP-workload-optimizer
v2.0.0 Advanced ML, RL Policy Optimization & Real-World Integration
v2.0.0 — Advanced ML, RL Policy Optimization & Real-World Integration
Major release adding GPU-accelerated model training, reinforcement learning–based policy search, live HPC cluster connectors, and a full documentation site.
Performance Breakthroughs
- LightGBM backend: 50× faster model training (7.6s vs 383s), GPU acceleration on NVIDIA GPUs
- RL policy search: 25.5% improvement over EASY_BACKFILL (p95 BSLD 2.85 vs 3.82)
- Ensemble predictor: Multi-backend auto-weighting via inverse pinball loss
Real-World Integration
- Slurm live connector — reads sacct/squeue, converts to canonical format, runs predictions
- PBS Pro live connector — parses qstat JSON output with full job state tracking
- Prediction feedback loop — JSONL persistence, interval coverage tracking, model drift detection with retraining alerts
- Prometheus metrics exporter — zero-dependency
/metricsendpoint, FastAPI mountable, Grafana-ready
Documentation Site
- MkDocs Material theme with dark/light mode
- 8 tutorials: quickstart, installation, Rust engine, benchmarks, LightGBM, RL search, live integration, deployment
- Interactive benchmark dashboard (docs/dashboard.html)
Code Quality
- Ruff lint: 0 errors (85+ fixes applied)
- Mypy type-check: 0 errors (5 type fixes)
- 314 unit tests passing
ruff formatapplied project-wide
New Files
| File | Purpose |
|---|---|
| python/hpcopt/models/ensemble.py | Multi-backend ensemble predictor |
| python/hpcopt/simulate/rl_env.py | Gym-like RL scheduling environment |
| python/hpcopt/integrations/slurm_connector.py | Live Slurm adapter |
| python/hpcopt/integrations/pbs_connector.py | Live PBS Pro adapter |
| python/hpcopt/integrations/feedback.py | Prediction accuracy tracker |
| python/hpcopt/integrations/metrics_exporter.py | Prometheus exporter |
| mkdocs.yml | Documentation site config |
docs/tutorials/*.md |
8 tutorial pages |
| docs/dashboard.html | Interactive results dashboard |
Key Metrics
| Metric | Value |
|---|---|
| Simulation speedup (Rust) | 16,000–51,000× |
| Training speedup (LightGBM) | 50× |
| Scheduling quality (RL vs EASY) | +25.5% |
| BSLD improvement (EASY vs FIFO) | 92–99.6% |
| Unit tests | 314 passing |
Full Changelog: v1.2.0...v2.0.0
v1.2.0 Security, Resilience & Observability Upgrade
- feat: Implement API deprecation configuration loading and a runtime quantile prediction model. (aaa3ca6)
- feat: Implement core API endpoints, feature engineering pipeline, and comprehensive documentation for the HPC workload optimizer. (9039d7b)
- test: Add API load tests for concurrent runtime predictions and health endpoint responsiveness with p95 latency checks. (9d4157e)
- ci errors fixed (0f8c970)
- ci errors fixed (713eb35)
- ci errors fixed (5e3b50e)
Full Changelog: v1.0.0...v1.2.0
v1.0.0 HPC Scheduling Advisory Platform
v1.0.0 — Production-Ready HPC Scheduling Advisory Platform
Systems-first HPC scheduling research and engineering platform (Python + Rust) for reproducible policy evaluation under uncertainty. This release marks the first production-ready milestone with a 9.0/10 industrial-grade readiness score across 15 audit dimensions.
Highlights
- Contract-driven simulation — Deterministic replay with executable invariants, fidelity gating, and fairness/starvation bounds
- Full credibility protocol — Multi-trace orchestration with per-trace fidelity, sensitivity sweeps, and recommendation dossiers
- Kubernetes-ready — 9 production manifests (Deployment, HPA, ServiceMonitor, OTel Collector, Alertmanager)
- 13-job CI/CD pipeline — Lint, strict mypy, SAST, dependency audit, secret scan, E2E smoke test, cross-language parity, coverage gate, OpenAPI compat, Docker build, and release gate with SBOM
- 100+ tests at 75%+ coverage — including Hypothesis property-based tests for simulation invariants, fidelity metrics, adapter contracts, and recommendation engine
What's New (since 0.1.0)
Core Platform
- Multi-format ingestion (SWF, Slurm, PBS/Torque) with canonical parquet export
- Runtime quantile modeling (p10/p50/p90) with monotonic enforcement and fallback semantics
- Resource-fit modeling (fragmentation classifier + node size regressor)
- Deterministic simulation for FIFO_STRICT, EASY_BACKFILL_BASELINE, and ML_BACKFILL_P50
- Recommendation engine with single-objective and Pareto multi-objective modes
- Benchmark suite with regression gate and history tracking
Production Infrastructure
- FastAPI with per-endpoint rate limiting, request draining, and deprecation headers (RFC 8594/9745)
- Prometheus metrics + Grafana dashboard (8 panels)
- OpenTelemetry distributed tracing with per-environment sampling
- SLOs (99.5% availability), error budgets, and Alertmanager routing (PagerDuty + Slack)
- File-based API key auth with 3-tier loading and hot rotation
- Docker (multi-stage, pinned digests, secrets mount) + Kubernetes manifests
Engineering Quality
- Strict mypy (
disallow_untyped_defs = true) with PEP 561 marker - Property-based tests (Hypothesis) across 5 core modules
- Cross-language adapter parity testing (Python ↔ Rust)
- 10 JSON schemas locked with
additionalProperties: false - Immutable run manifests with hashes, seeds, and environment fingerprints
- Pre-commit hooks (ruff, bandit, mypy, hadolint)
Operations
- 32 documentation files: 10-chapter technical docs, 9 ops guides, 5 runbooks, 3 security docs
- Disaster recovery drill script
- Drift monitoring with PSI + metric degradation tracking
- Artifact retention with production-model protection
Quick Start
pip install -e ".[dev]"
hpcopt ingest swf --input data/raw/trace.swf.gz --dataset-id my_trace --out data/curated
hpcopt simulate replay-baselines --trace data/curated/my_trace.parquet --capacity-cpus 64