Skip to content

Releases: ErenAri/HCP-workload-optimizer

v2.0.0 Advanced ML, RL Policy Optimization & Real-World Integration

26 Feb 17:16

Choose a tag to compare

v2.0.0 — Advanced ML, RL Policy Optimization & Real-World Integration

Major release adding GPU-accelerated model training, reinforcement learning–based policy search, live HPC cluster connectors, and a full documentation site.

Performance Breakthroughs

  • LightGBM backend: 50× faster model training (7.6s vs 383s), GPU acceleration on NVIDIA GPUs
  • RL policy search: 25.5% improvement over EASY_BACKFILL (p95 BSLD 2.85 vs 3.82)
  • Ensemble predictor: Multi-backend auto-weighting via inverse pinball loss

Real-World Integration

  • Slurm live connector — reads sacct/squeue, converts to canonical format, runs predictions
  • PBS Pro live connector — parses qstat JSON output with full job state tracking
  • Prediction feedback loop — JSONL persistence, interval coverage tracking, model drift detection with retraining alerts
  • Prometheus metrics exporter — zero-dependency /metrics endpoint, FastAPI mountable, Grafana-ready

Documentation Site

  • MkDocs Material theme with dark/light mode
  • 8 tutorials: quickstart, installation, Rust engine, benchmarks, LightGBM, RL search, live integration, deployment
  • Interactive benchmark dashboard (docs/dashboard.html)

Code Quality

  • Ruff lint: 0 errors (85+ fixes applied)
  • Mypy type-check: 0 errors (5 type fixes)
  • 314 unit tests passing
  • ruff format applied project-wide

New Files

File Purpose
python/hpcopt/models/ensemble.py Multi-backend ensemble predictor
python/hpcopt/simulate/rl_env.py Gym-like RL scheduling environment
python/hpcopt/integrations/slurm_connector.py Live Slurm adapter
python/hpcopt/integrations/pbs_connector.py Live PBS Pro adapter
python/hpcopt/integrations/feedback.py Prediction accuracy tracker
python/hpcopt/integrations/metrics_exporter.py Prometheus exporter
mkdocs.yml Documentation site config
docs/tutorials/*.md 8 tutorial pages
docs/dashboard.html Interactive results dashboard

Key Metrics

Metric Value
Simulation speedup (Rust) 16,000–51,000×
Training speedup (LightGBM) 50×
Scheduling quality (RL vs EASY) +25.5%
BSLD improvement (EASY vs FIFO) 92–99.6%
Unit tests 314 passing

Full Changelog: v1.2.0...v2.0.0

v1.2.0 Security, Resilience & Observability Upgrade

23 Feb 17:50

Choose a tag to compare

  • feat: Implement API deprecation configuration loading and a runtime quantile prediction model. (aaa3ca6)
  • feat: Implement core API endpoints, feature engineering pipeline, and comprehensive documentation for the HPC workload optimizer. (9039d7b)
  • test: Add API load tests for concurrent runtime predictions and health endpoint responsiveness with p95 latency checks. (9d4157e)
  • ci errors fixed (0f8c970)
  • ci errors fixed (713eb35)
  • ci errors fixed (5e3b50e)

Full Changelog: v1.0.0...v1.2.0

v1.0.0 HPC Scheduling Advisory Platform

22 Feb 21:14

Choose a tag to compare

v1.0.0 — Production-Ready HPC Scheduling Advisory Platform

Systems-first HPC scheduling research and engineering platform (Python + Rust) for reproducible policy evaluation under uncertainty. This release marks the first production-ready milestone with a 9.0/10 industrial-grade readiness score across 15 audit dimensions.

Highlights

  • Contract-driven simulation — Deterministic replay with executable invariants, fidelity gating, and fairness/starvation bounds
  • Full credibility protocol — Multi-trace orchestration with per-trace fidelity, sensitivity sweeps, and recommendation dossiers
  • Kubernetes-ready — 9 production manifests (Deployment, HPA, ServiceMonitor, OTel Collector, Alertmanager)
  • 13-job CI/CD pipeline — Lint, strict mypy, SAST, dependency audit, secret scan, E2E smoke test, cross-language parity, coverage gate, OpenAPI compat, Docker build, and release gate with SBOM
  • 100+ tests at 75%+ coverage — including Hypothesis property-based tests for simulation invariants, fidelity metrics, adapter contracts, and recommendation engine

What's New (since 0.1.0)

Core Platform

  • Multi-format ingestion (SWF, Slurm, PBS/Torque) with canonical parquet export
  • Runtime quantile modeling (p10/p50/p90) with monotonic enforcement and fallback semantics
  • Resource-fit modeling (fragmentation classifier + node size regressor)
  • Deterministic simulation for FIFO_STRICT, EASY_BACKFILL_BASELINE, and ML_BACKFILL_P50
  • Recommendation engine with single-objective and Pareto multi-objective modes
  • Benchmark suite with regression gate and history tracking

Production Infrastructure

  • FastAPI with per-endpoint rate limiting, request draining, and deprecation headers (RFC 8594/9745)
  • Prometheus metrics + Grafana dashboard (8 panels)
  • OpenTelemetry distributed tracing with per-environment sampling
  • SLOs (99.5% availability), error budgets, and Alertmanager routing (PagerDuty + Slack)
  • File-based API key auth with 3-tier loading and hot rotation
  • Docker (multi-stage, pinned digests, secrets mount) + Kubernetes manifests

Engineering Quality

  • Strict mypy (disallow_untyped_defs = true) with PEP 561 marker
  • Property-based tests (Hypothesis) across 5 core modules
  • Cross-language adapter parity testing (Python ↔ Rust)
  • 10 JSON schemas locked with additionalProperties: false
  • Immutable run manifests with hashes, seeds, and environment fingerprints
  • Pre-commit hooks (ruff, bandit, mypy, hadolint)

Operations

  • 32 documentation files: 10-chapter technical docs, 9 ops guides, 5 runbooks, 3 security docs
  • Disaster recovery drill script
  • Drift monitoring with PSI + metric degradation tracking
  • Artifact retention with production-model protection

Quick Start

pip install -e ".[dev]"
hpcopt ingest swf --input data/raw/trace.swf.gz --dataset-id my_trace --out data/curated
hpcopt simulate replay-baselines --trace data/curated/my_trace.parquet --capacity-cpus 64

v0.1.0

21 Feb 18:05

Choose a tag to compare