26 Feb 17:16

ErenAri

d79a0ff

v2.0.0 Advanced ML, RL Policy Optimization & Real-World Integration Latest

Latest

v2.0.0 — Advanced ML, RL Policy Optimization & Real-World Integration

Major release adding GPU-accelerated model training, reinforcement learning–based policy search, live HPC cluster connectors, and a full documentation site.

Performance Breakthroughs

LightGBM backend: 50× faster model training (7.6s vs 383s), GPU acceleration on NVIDIA GPUs
RL policy search: 25.5% improvement over EASY_BACKFILL (p95 BSLD 2.85 vs 3.82)
Ensemble predictor: Multi-backend auto-weighting via inverse pinball loss

Real-World Integration

Slurm live connector — reads sacct/squeue, converts to canonical format, runs predictions
PBS Pro live connector — parses qstat JSON output with full job state tracking
Prediction feedback loop — JSONL persistence, interval coverage tracking, model drift detection with retraining alerts
Prometheus metrics exporter — zero-dependency /metrics endpoint, FastAPI mountable, Grafana-ready

Documentation Site

MkDocs Material theme with dark/light mode
8 tutorials: quickstart, installation, Rust engine, benchmarks, LightGBM, RL search, live integration, deployment
Interactive benchmark dashboard (docs/dashboard.html)

Code Quality

Ruff lint: 0 errors (85+ fixes applied)
Mypy type-check: 0 errors (5 type fixes)
314 unit tests passing
ruff format applied project-wide

New Files

File	Purpose
python/hpcopt/models/ensemble.py	Multi-backend ensemble predictor
python/hpcopt/simulate/rl_env.py	Gym-like RL scheduling environment
python/hpcopt/integrations/slurm_connector.py	Live Slurm adapter
python/hpcopt/integrations/pbs_connector.py	Live PBS Pro adapter
python/hpcopt/integrations/feedback.py	Prediction accuracy tracker
python/hpcopt/integrations/metrics_exporter.py	Prometheus exporter
mkdocs.yml	Documentation site config
`docs/tutorials/*.md`	8 tutorial pages
docs/dashboard.html	Interactive results dashboard

Key Metrics

Metric	Value
Simulation speedup (Rust)	16,000–51,000×
Training speedup (LightGBM)	50×
Scheduling quality (RL vs EASY)	+25.5%
BSLD improvement (EASY vs FIFO)	92–99.6%
Unit tests	314 passing

Full Changelog: v1.2.0...v2.0.0

Assets 2

23 Feb 17:50

ErenAri

v1.2.0

aaa3ca6

v1.2.0 Security, Resilience & Observability Upgrade

feat: Implement API deprecation configuration loading and a runtime quantile prediction model. (aaa3ca6)
feat: Implement core API endpoints, feature engineering pipeline, and comprehensive documentation for the HPC workload optimizer. (9039d7b)
test: Add API load tests for concurrent runtime predictions and health endpoint responsiveness with p95 latency checks. (9d4157e)
ci errors fixed (0f8c970)
ci errors fixed (713eb35)
ci errors fixed (5e3b50e)

Full Changelog: v1.0.0...v1.2.0

Assets 4

22 Feb 21:14

github-actions

v1.0.0

516d274

v1.0.0 HPC Scheduling Advisory Platform

v1.0.0 — Production-Ready HPC Scheduling Advisory Platform

Systems-first HPC scheduling research and engineering platform (Python + Rust) for reproducible policy evaluation under uncertainty. This release marks the first production-ready milestone with a 9.0/10 industrial-grade readiness score across 15 audit dimensions.

Highlights

Contract-driven simulation — Deterministic replay with executable invariants, fidelity gating, and fairness/starvation bounds
Full credibility protocol — Multi-trace orchestration with per-trace fidelity, sensitivity sweeps, and recommendation dossiers
Kubernetes-ready — 9 production manifests (Deployment, HPA, ServiceMonitor, OTel Collector, Alertmanager)
13-job CI/CD pipeline — Lint, strict mypy, SAST, dependency audit, secret scan, E2E smoke test, cross-language parity, coverage gate, OpenAPI compat, Docker build, and release gate with SBOM
100+ tests at 75%+ coverage — including Hypothesis property-based tests for simulation invariants, fidelity metrics, adapter contracts, and recommendation engine

What's New (since 0.1.0)

Core Platform

Multi-format ingestion (SWF, Slurm, PBS/Torque) with canonical parquet export
Runtime quantile modeling (p10/p50/p90) with monotonic enforcement and fallback semantics
Resource-fit modeling (fragmentation classifier + node size regressor)
Deterministic simulation for FIFO_STRICT, EASY_BACKFILL_BASELINE, and ML_BACKFILL_P50
Recommendation engine with single-objective and Pareto multi-objective modes
Benchmark suite with regression gate and history tracking

Production Infrastructure

FastAPI with per-endpoint rate limiting, request draining, and deprecation headers (RFC 8594/9745)
Prometheus metrics + Grafana dashboard (8 panels)
OpenTelemetry distributed tracing with per-environment sampling
SLOs (99.5% availability), error budgets, and Alertmanager routing (PagerDuty + Slack)
File-based API key auth with 3-tier loading and hot rotation
Docker (multi-stage, pinned digests, secrets mount) + Kubernetes manifests

Engineering Quality

Strict mypy (disallow_untyped_defs = true) with PEP 561 marker
Property-based tests (Hypothesis) across 5 core modules
Cross-language adapter parity testing (Python ↔ Rust)
10 JSON schemas locked with additionalProperties: false
Immutable run manifests with hashes, seeds, and environment fingerprints
Pre-commit hooks (ruff, bandit, mypy, hadolint)

Operations

32 documentation files: 10-chapter technical docs, 9 ops guides, 5 runbooks, 3 security docs
Disaster recovery drill script
Drift monitoring with PSI + metric degradation tracking
Artifact retention with production-model protection

Quick Start

pip install -e ".[dev]"
hpcopt ingest swf --input data/raw/trace.swf.gz --dataset-id my_trace --out data/curated
hpcopt simulate replay-baselines --trace data/curated/my_trace.parquet --capacity-cpus 64

Assets 4

21 Feb 18:05

github-actions

v0.1.0

b4621a6

v0.1.0

Full Changelog: https://github.com/ErenAri/HCP-workload-optimizer/commits/v0.1.0

Assets 2

Releases: ErenAri/HCP-workload-optimizer

v2.0.0 Advanced ML, RL Policy Optimization & Real-World Integration

v2.0.0 — Advanced ML, RL Policy Optimization & Real-World Integration

Performance Breakthroughs

Real-World Integration

Documentation Site

Code Quality

New Files

Key Metrics

Uh oh!

v1.2.0 Security, Resilience & Observability Upgrade

Uh oh!

v1.0.0 HPC Scheduling Advisory Platform

v1.0.0 — Production-Ready HPC Scheduling Advisory Platform

Highlights

What's New (since 0.1.0)

Core Platform

Production Infrastructure

Engineering Quality

Operations

Quick Start

Uh oh!

v0.1.0

Uh oh!