feat(eval): SA eval pipeline with per-stage perf, quantize step, and workflow HTML report#599
Conversation
…rst HTML report - run_sa_eval.py: run wmk perf after export, graph-optimize, SA-optimize, and QDQ-quantize (stage 6); add --no-perf, --perf-iterations, --perf-warmup, --no-quantize, --quantize-precision, --quantize-samples, --report-only flags; fix output_dir to resolve to absolute path; downgrade empty SA classification from fatal error to warning - sa_comparison.py: warn instead of silently returning empty results when SA produces no EP results (missing parquet rule data) - sa_report.py: reorganize table columns in workflow order (Export → Normalize → Pre SA → Flags → Optimized → Post SA → Quantize → Delta); chain-normalize perf gain% against previous stage; add __main__ CLI entrypoint for report-only refresh - quantize.py: add --model-name CLI option so task-aware calibration can load the correct HuggingFace tokenizer/processor
- Default sort by perf gain descending (Unlocked models float to top) - Add perf gain summary cards: Avg Perf Gain, Faster Models, Unlocked count - Reorder summary cards to show perf gain metrics first - Unlocked badge: compact purple pill style '⚡ Unlocked · Xms' - Hide models without quantize perf from main table - Add footer showing quantized vs total complete model counts - Rename report title to 'WinML CLI Component Analysis Report' - Remove Regressed summary card
Replace 6 manual wmk stages with winml config + winml build:
- Stage 1: winml config → build_config.json (export/quant/compile settings)
- Stage 2: winml build → export.onnx, optimized.onnx, quantized.onnx,
compiled.onnx, winml_build_config.json
- Stage 3: SA pre-check on export.onnx (via ONNXStaticAnalyzer Python API)
- Stage 4: SA post-check on optimized.onnx
- Stage 5: EPContext diff on compiled.onnx (produced by build)
Read SA optimization flags from winml_build_config.json['optim'] instead
of computing them via SA API. Result schema is backward-compatible with
sa_report.py (perf.graph_optimized=None, perf.sa_optimized=optimized.onnx).
Add --no-compile flag; remove unused run_wmk_export helper.
…peline - Revert quantize.py changes (deferred to PR #608) - Fix redundant json import in sa_report.py (CodeQL) - Replace wmk references with winml in comments/docstrings - Add _resolve_ep_arg() for fail-fast on unknown EPs - Add SA false alarm detection (UNSUPPORTED/PARTIAL ops that EP handles) - Add stage_compile_post for non-NPU devices missing compiled.onnx - Skip quantize entirely for non-NPU devices - Show per-stage ONNX node counts in report - Fix Per-Model table filtering and perf gain fallback - Add --registry flag for run_eval.py compatibility
Standalone script to export HuggingFace models to ONNX using winml export, with the same config/task parameters as run_eval.py for identical outputs. Supports --registry, --model, --task, --priority filtering, composite models, and skip-if-exists caching.
… run_sa_eval.py Matches run_eval.py behavior: default priority P0+P1+P2, P3 excluded. Filters only apply when --registry is used.
…s, add Correct column SA Comparison tab now shows: - Correct: TN ops (SA correctly detected PARTIAL/UNSUPPORTED) - Partial FA: SA said PARTIAL but EP fully handled (less severe) - Unsup FA: SA said UNSUPPORTED but EP handled (more severe)
Extensionless ONNX external data files (weight tensors) were not being cleaned up, causing 10+ GB disk usage per large model.
DingmaomaoBJTU
left a comment
There was a problem hiding this comment.
Thorough review of the SA eval pipeline. Found several correctness bugs and edge-case gaps — details in the inline comments below.
- gen_eval_model: handle TimeoutExpired on proc.wait after kill - gen_eval_model: fix cache-skip for composite models (export_*.onnx) - run_sa_eval: cleanup_onnx_artifacts recurses subdirs, handles OSError - run_sa_eval: fix stage_sa_pre docstring (graph_optimized, not export) - run_sa_eval: --report-only respects --registry for report metadata - sa_comparison: guard get_optimization_config when ep_found=False
…ation - Add explanatory comments to empty except clauses (CodeQL) - Build cache now validates quantized/compiled artifacts when requested
--ep now accepts both short alias (qnn, dml) and full ORT name (QNNExecutionProvider), matching run_eval.py behavior.
timenick
left a comment
There was a problem hiding this comment.
Three minor observations on the new pipeline — all low-severity.
🤖 Generated with GitHub Copilot CLI
|
🟠
if use_cache and is_cached(export_path) and is_cached(optimized_path):
if run_quantize and not is_cached(model_dir / "quantized.onnx"):
safe_print(" [Build] Cache incomplete (missing quantized.onnx), rebuilding...")
elif run_compile and not is_cached(model_dir / "compiled.onnx"):
safe_print(" [Build] Cache incomplete (missing compiled.onnx), rebuilding...")
else:
...
return NoneThe "rebuilding..." log falls through to the build path, but |
- sa_report: fix _final_perf using `or` which treats 0.0 as falsy - sa_report: show red for negative avg perf gain (regression) - run_sa_eval: log message when non-NPU auto-skips quantize
- Pass --rebuild when cache is incomplete (missing quantized/compiled) - Change log from "rebuilding" to "resuming build" for clarity - Updated PR description to match current implementation
Summary
winml config+winml buildfor model generation (export, optimize, quantize, compile), replacing the old subprocess-based export + Python optimize approachwinml perfruns on export, optimized, quantized, and compiled ONNX models; each perf column shows gain% vs the previous stagestage_compile_postto compile optimized.onnx directly when build skips compile--registryflag: accepts model registry JSON (same format asrun_eval.py), with--priority,--task,--group,--model-typefilters (default: P0+P1+P2)--epaccepts both short alias (qnn, dml, cpu) and full ORT name (QNNExecutionProvider)--report-onlyflag: regenerates the HTML from existing per-modelsa_eval_result.jsonfiles without re-running any eval stages--cleanupfix: now removes all non-JSON files including extensionless ONNX external data and recurses subdirectoriesgen_eval_model.py: new standalone script to batch-export HuggingFace models to ONNX viawinml export, with identical config/task parameters asrun_eval.pyReview comment fixes
quantize.pychanges (deferred to PR Quantize: Implement functional e2e test cases and fix issues found during test #608)import jsoninsa_report.py(CodeQL)wmkreferences withwinmlin comments/docstrings_resolve_ep_arg()for fail-fast on unknown EPs_final_perfusingorwhich treats0.0as falsyget_optimization_configwhenep_found=FalseUsage
🤖 Generated with Claude Code