feat(eval): SA eval pipeline with per-stage perf, quantize step, and workflow HTML report by DingmaomaoBJTU · Pull Request #599 · microsoft/winml-cli

DingmaomaoBJTU · 2026-05-12T08:16:37Z

Summary

SA eval pipeline rewrite: uses winml config + winml build for model generation (export, optimize, quantize, compile), replacing the old subprocess-based export + Python optimize approach
4-stage perf benchmarking: winml perf runs on export, optimized, quantized, and compiled ONNX models; each perf column shows gain% vs the previous stage
SA false alarm detection: compares SA predictions against EPContext ground truth to detect UNSUPPORTED and PARTIAL false alarms (ops SA predicted non-supported but EP actually handled)
Non-NPU support: automatically skips quantize for non-NPU devices; adds stage_compile_post to compile optimized.onnx directly when build skips compile
Per-stage node counts: reports graph node count for each pipeline stage (export → graph_optimized → optimized → quantized → compiled)
--registry flag: accepts model registry JSON (same format as run_eval.py), with --priority, --task, --group, --model-type filters (default: P0+P1+P2)
EP alias support: --ep accepts both short alias (qnn, dml, cpu) and full ORT name (QNNExecutionProvider)
--report-only flag: regenerates the HTML from existing per-model sa_eval_result.json files without re-running any eval stages
--cleanup fix: now removes all non-JSON files including extensionless ONNX external data and recurses subdirectories
gen_eval_model.py: new standalone script to batch-export HuggingFace models to ONNX via winml export, with identical config/task parameters as run_eval.py

Review comment fixes

Reverted quantize.py changes (deferred to PR Quantize: Implement functional e2e test cases and fix issues found during test #608)
Fixed redundant import json in sa_report.py (CodeQL)
Replaced all wmk references with winml in comments/docstrings
Added _resolve_ep_arg() for fail-fast on unknown EPs
Fixed _final_perf using or which treats 0.0 as falsy
Guard get_optimization_config when ep_found=False

Usage

# Full eval on NPU with QNN
uv run python scripts/e2e_eval/run_sa_eval.py --registry scripts/e2e_eval/testsets/models_all.json \
  --ep qnn --device npu --use-cache --cleanup

# Single model on GPU (quantize auto-skipped)
uv run python scripts/e2e_eval/run_sa_eval.py --model google/vit-base-patch16-224 --ep qnn --device gpu

# Regenerate report only
uv run python scripts/e2e_eval/run_sa_eval.py --report-only --output-dir sa_eval_results/2026-06-08

# Batch export models
uv run python scripts/e2e_eval/gen_eval_model.py --model microsoft/resnet-50
uv run python scripts/e2e_eval/gen_eval_model.py --priority P0 --task image-classification

🤖 Generated with Claude Code

…rst HTML report - run_sa_eval.py: run wmk perf after export, graph-optimize, SA-optimize, and QDQ-quantize (stage 6); add --no-perf, --perf-iterations, --perf-warmup, --no-quantize, --quantize-precision, --quantize-samples, --report-only flags; fix output_dir to resolve to absolute path; downgrade empty SA classification from fatal error to warning - sa_comparison.py: warn instead of silently returning empty results when SA produces no EP results (missing parquet rule data) - sa_report.py: reorganize table columns in workflow order (Export → Normalize → Pre SA → Flags → Optimized → Post SA → Quantize → Delta); chain-normalize perf gain% against previous stage; add __main__ CLI entrypoint for report-only refresh - quantize.py: add --model-name CLI option so task-aware calibration can load the correct HuggingFace tokenizer/processor

…r each model

- Default sort by perf gain descending (Unlocked models float to top) - Add perf gain summary cards: Avg Perf Gain, Faster Models, Unlocked count - Reorder summary cards to show perf gain metrics first - Unlocked badge: compact purple pill style '⚡ Unlocked · Xms' - Hide models without quantize perf from main table - Add footer showing quantized vs total complete model counts - Rename report title to 'WinML CLI Component Analysis Report' - Remove Regressed summary card

Replace 6 manual wmk stages with winml config + winml build: - Stage 1: winml config → build_config.json (export/quant/compile settings) - Stage 2: winml build → export.onnx, optimized.onnx, quantized.onnx, compiled.onnx, winml_build_config.json - Stage 3: SA pre-check on export.onnx (via ONNXStaticAnalyzer Python API) - Stage 4: SA post-check on optimized.onnx - Stage 5: EPContext diff on compiled.onnx (produced by build) Read SA optimization flags from winml_build_config.json['optim'] instead of computing them via SA API. Result schema is backward-compatible with sa_report.py (perf.graph_optimized=None, perf.sa_optimized=optimized.onnx). Add --no-compile flag; remove unused run_wmk_export helper.

…peline - Revert quantize.py changes (deferred to PR #608) - Fix redundant json import in sa_report.py (CodeQL) - Replace wmk references with winml in comments/docstrings - Add _resolve_ep_arg() for fail-fast on unknown EPs - Add SA false alarm detection (UNSUPPORTED/PARTIAL ops that EP handles) - Add stage_compile_post for non-NPU devices missing compiled.onnx - Skip quantize entirely for non-NPU devices - Show per-stage ONNX node counts in report - Fix Per-Model table filtering and perf gain fallback - Add --registry flag for run_eval.py compatibility

Standalone script to export HuggingFace models to ONNX using winml export, with the same config/task parameters as run_eval.py for identical outputs. Supports --registry, --model, --task, --priority filtering, composite models, and skip-if-exists caching.

… run_sa_eval.py Matches run_eval.py behavior: default priority P0+P1+P2, P3 excluded. Filters only apply when --registry is used.

…s, add Correct column SA Comparison tab now shows: - Correct: TN ops (SA correctly detected PARTIAL/UNSUPPORTED) - Partial FA: SA said PARTIAL but EP fully handled (less severe) - Unsup FA: SA said UNSUPPORTED but EP handled (more severe)

Extensionless ONNX external data files (weight tensors) were not being cleaned up, causing 10+ GB disk usage per large model.

DingmaomaoBJTU

Thorough review of the SA eval pipeline. Found several correctness bugs and edge-case gaps — details in the inline comments below.

- gen_eval_model: handle TimeoutExpired on proc.wait after kill - gen_eval_model: fix cache-skip for composite models (export_*.onnx) - run_sa_eval: cleanup_onnx_artifacts recurses subdirs, handles OSError - run_sa_eval: fix stage_sa_pre docstring (graph_optimized, not export) - run_sa_eval: --report-only respects --registry for report metadata - sa_comparison: guard get_optimization_config when ep_found=False

…ation - Add explanatory comments to empty except clauses (CodeQL) - Build cache now validates quantized/compiled artifacts when requested

--ep now accepts both short alias (qnn, dml) and full ORT name (QNNExecutionProvider), matching run_eval.py behavior.

timenick

Three minor observations on the new pipeline — all low-severity.

🤖 Generated with GitHub Copilot CLI

xieofxie · 2026-06-09T05:05:14Z

🟠 stage_build cache logic — misleading log + likely incomplete rebuild

scripts/e2e_eval/run_sa_eval.py:682-690:

if use_cache and is_cached(export_path) and is_cached(optimized_path):
    if run_quantize and not is_cached(model_dir / "quantized.onnx"):
        safe_print("  [Build] Cache incomplete (missing quantized.onnx), rebuilding...")
    elif run_compile and not is_cached(model_dir / "compiled.onnx"):
        safe_print("  [Build] Cache incomplete (missing compiled.onnx), rebuilding...")
    else:
        ...
        return None

The "rebuilding..." log falls through to the build path, but --rebuild is only passed when not use_cache (line 731). So we tell the user we're rebuilding while in fact calling winml build without --rebuild — the behavior depends on whatever winml build does when artifacts exist. Either pass --rebuild here, or change the log to say "resuming build for missing artifacts."

- sa_report: fix _final_perf using `or` which treats 0.0 as falsy - sa_report: show red for negative avg perf gain (regression) - run_sa_eval: log message when non-NPU auto-skips quantize

- Pass --rebuild when cache is incomplete (missing quantized/compiled) - Change log from "rebuilding" to "resuming build" for clarity - Updated PR description to match current implementation

DingmaomaoBJTU requested a review from a team as a code owner May 12, 2026 08:16

feat(eval): add --cleanup flag to delete intermediate ONNX files afte…

0b899f3

…r each model

xieofxie reviewed May 12, 2026

View reviewed changes

Comment thread src/winml/modelkit/commands/quantize.py

xieofxie reviewed May 12, 2026

View reviewed changes

Comment thread src/winml/modelkit/commands/quantize.py

github-advanced-security AI found potential problems May 12, 2026

View reviewed changes

Comment thread scripts/e2e_eval/sa_report.py Fixed

timenick reviewed May 13, 2026

View reviewed changes

Comment thread src/winml/modelkit/commands/quantize.py

Comment thread scripts/e2e_eval/run_sa_eval.py Outdated

Comment thread scripts/e2e_eval/run_sa_eval.py

Comment thread scripts/e2e_eval/run_sa_eval.py Outdated

DingmaomaoBJTU added 2 commits May 13, 2026 16:15

zhenchaoni reviewed May 13, 2026

View reviewed changes

Comment thread src/winml/modelkit/commands/quantize.py

DingmaomaoBJTU added 7 commits June 8, 2026 18:41

merge: resolve conflicts with main, keep winml build pipeline

52846a5

feat(e2e_eval): add --priority/--task/--group/--model-type filters to…

d0d92ae

… run_sa_eval.py Matches run_eval.py behavior: default priority P0+P1+P2, P3 excluded. Filters only apply when --registry is used.

fix(e2e_eval): cleanup_onnx_artifacts now removes all non-JSON files

913087e

Extensionless ONNX external data files (weight tensors) were not being cleaned up, causing 10+ GB disk usage per large model.

feat(e2e_eval): show EP and device in SA report header

4c77843

DingmaomaoBJTU commented Jun 9, 2026

View reviewed changes

github-advanced-security AI found potential problems Jun 9, 2026

View reviewed changes

Comment thread scripts/e2e_eval/gen_eval_model.py Fixed

Comment thread scripts/e2e_eval/run_sa_eval.py Fixed

DingmaomaoBJTU added 2 commits June 9, 2026 11:19

fix(e2e_eval): fix CodeQL empty-except warnings and build cache valid…

5f5dac7

…ation - Add explanatory comments to empty except clauses (CodeQL) - Build cache now validates quantized/compiled artifacts when requested

feat(e2e_eval): support EP alias in run_sa_eval.py (qnn, dml, cpu, etc.)

a0fe22e

--ep now accepts both short alias (qnn, dml) and full ORT name (QNNExecutionProvider), matching run_eval.py behavior.

timenick reviewed Jun 9, 2026

View reviewed changes

Comment thread scripts/e2e_eval/sa_report.py Outdated

Comment thread scripts/e2e_eval/sa_report.py Outdated

Comment thread scripts/e2e_eval/run_sa_eval.py

DingmaomaoBJTU added 3 commits June 9, 2026 13:47

fix(e2e_eval): address 3 review comments

832166d

- sa_report: fix _final_perf using `or` which treats 0.0 as falsy - sa_report: show red for negative avg perf gain (regression) - run_sa_eval: log message when non-NPU auto-skips quantize

Merge remote-tracking branch 'origin/main' into qiowu/add_sa_eval

bf9b95c

fix(e2e_eval): fix build cache logic and update PR description

db6e79c

- Pass --rebuild when cache is incomplete (missing quantized/compiled) - Change log from "rebuilding" to "resuming build" for clarity - Updated PR description to match current implementation

timenick approved these changes Jun 9, 2026

View reviewed changes

xieofxie approved these changes Jun 9, 2026

View reviewed changes

DingmaomaoBJTU enabled auto-merge (squash) June 9, 2026 06:08

DingmaomaoBJTU merged commit c6376a9 into main Jun 9, 2026
9 checks passed

DingmaomaoBJTU deleted the qiowu/add_sa_eval branch June 9, 2026 06:09

Uh oh!

Conversation

DingmaomaoBJTU commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Review comment fixes

Usage

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timenick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xieofxie commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

DingmaomaoBJTU commented May 12, 2026 •

edited

Loading