Skip to content

feat(eval): SA eval pipeline with per-stage perf, quantize step, and workflow HTML report#599

Merged
DingmaomaoBJTU merged 17 commits into
mainfrom
qiowu/add_sa_eval
Jun 9, 2026
Merged

feat(eval): SA eval pipeline with per-stage perf, quantize step, and workflow HTML report#599
DingmaomaoBJTU merged 17 commits into
mainfrom
qiowu/add_sa_eval

Conversation

@DingmaomaoBJTU

@DingmaomaoBJTU DingmaomaoBJTU commented May 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • SA eval pipeline rewrite: uses winml config + winml build for model generation (export, optimize, quantize, compile), replacing the old subprocess-based export + Python optimize approach
  • 4-stage perf benchmarking: winml perf runs on export, optimized, quantized, and compiled ONNX models; each perf column shows gain% vs the previous stage
  • SA false alarm detection: compares SA predictions against EPContext ground truth to detect UNSUPPORTED and PARTIAL false alarms (ops SA predicted non-supported but EP actually handled)
  • Non-NPU support: automatically skips quantize for non-NPU devices; adds stage_compile_post to compile optimized.onnx directly when build skips compile
  • Per-stage node counts: reports graph node count for each pipeline stage (export → graph_optimized → optimized → quantized → compiled)
  • --registry flag: accepts model registry JSON (same format as run_eval.py), with --priority, --task, --group, --model-type filters (default: P0+P1+P2)
  • EP alias support: --ep accepts both short alias (qnn, dml, cpu) and full ORT name (QNNExecutionProvider)
  • --report-only flag: regenerates the HTML from existing per-model sa_eval_result.json files without re-running any eval stages
  • --cleanup fix: now removes all non-JSON files including extensionless ONNX external data and recurses subdirectories
  • gen_eval_model.py: new standalone script to batch-export HuggingFace models to ONNX via winml export, with identical config/task parameters as run_eval.py

Review comment fixes

Usage

# Full eval on NPU with QNN
uv run python scripts/e2e_eval/run_sa_eval.py --registry scripts/e2e_eval/testsets/models_all.json \
  --ep qnn --device npu --use-cache --cleanup

# Single model on GPU (quantize auto-skipped)
uv run python scripts/e2e_eval/run_sa_eval.py --model google/vit-base-patch16-224 --ep qnn --device gpu

# Regenerate report only
uv run python scripts/e2e_eval/run_sa_eval.py --report-only --output-dir sa_eval_results/2026-06-08

# Batch export models
uv run python scripts/e2e_eval/gen_eval_model.py --model microsoft/resnet-50
uv run python scripts/e2e_eval/gen_eval_model.py --priority P0 --task image-classification

🤖 Generated with Claude Code

…rst HTML report

- run_sa_eval.py: run wmk perf after export, graph-optimize, SA-optimize,
  and QDQ-quantize (stage 6); add --no-perf, --perf-iterations, --perf-warmup,
  --no-quantize, --quantize-precision, --quantize-samples, --report-only flags;
  fix output_dir to resolve to absolute path; downgrade empty SA classification
  from fatal error to warning
- sa_comparison.py: warn instead of silently returning empty results when SA
  produces no EP results (missing parquet rule data)
- sa_report.py: reorganize table columns in workflow order
  (Export → Normalize → Pre SA → Flags → Optimized → Post SA → Quantize → Delta);
  chain-normalize perf gain% against previous stage; add __main__ CLI entrypoint
  for report-only refresh
- quantize.py: add --model-name CLI option so task-aware calibration can load
  the correct HuggingFace tokenizer/processor
@DingmaomaoBJTU DingmaomaoBJTU requested a review from a team as a code owner May 12, 2026 08:16
Comment thread src/winml/modelkit/commands/quantize.py
Comment thread src/winml/modelkit/commands/quantize.py
Comment thread scripts/e2e_eval/sa_report.py Fixed
Comment thread src/winml/modelkit/commands/quantize.py
Comment thread scripts/e2e_eval/run_sa_eval.py Outdated
Comment thread scripts/e2e_eval/run_sa_eval.py
Comment thread scripts/e2e_eval/run_sa_eval.py Outdated
- Default sort by perf gain descending (Unlocked models float to top)
- Add perf gain summary cards: Avg Perf Gain, Faster Models, Unlocked count
- Reorder summary cards to show perf gain metrics first
- Unlocked badge: compact purple pill style '⚡ Unlocked · Xms'
- Hide models without quantize perf from main table
- Add footer showing quantized vs total complete model counts
- Rename report title to 'WinML CLI Component Analysis Report'
- Remove Regressed summary card
Replace 6 manual wmk stages with winml config + winml build:
- Stage 1: winml config → build_config.json (export/quant/compile settings)
- Stage 2: winml build → export.onnx, optimized.onnx, quantized.onnx,
                         compiled.onnx, winml_build_config.json
- Stage 3: SA pre-check on export.onnx (via ONNXStaticAnalyzer Python API)
- Stage 4: SA post-check on optimized.onnx
- Stage 5: EPContext diff on compiled.onnx (produced by build)

Read SA optimization flags from winml_build_config.json['optim'] instead
of computing them via SA API. Result schema is backward-compatible with
sa_report.py (perf.graph_optimized=None, perf.sa_optimized=optimized.onnx).

Add --no-compile flag; remove unused run_wmk_export helper.
Comment thread src/winml/modelkit/commands/quantize.py
…peline

- Revert quantize.py changes (deferred to PR #608)
- Fix redundant json import in sa_report.py (CodeQL)
- Replace wmk references with winml in comments/docstrings
- Add _resolve_ep_arg() for fail-fast on unknown EPs
- Add SA false alarm detection (UNSUPPORTED/PARTIAL ops that EP handles)
- Add stage_compile_post for non-NPU devices missing compiled.onnx
- Skip quantize entirely for non-NPU devices
- Show per-stage ONNX node counts in report
- Fix Per-Model table filtering and perf gain fallback
- Add --registry flag for run_eval.py compatibility
Standalone script to export HuggingFace models to ONNX using winml export,
with the same config/task parameters as run_eval.py for identical outputs.

Supports --registry, --model, --task, --priority filtering, composite
models, and skip-if-exists caching.
… run_sa_eval.py

Matches run_eval.py behavior: default priority P0+P1+P2, P3 excluded.
Filters only apply when --registry is used.
…s, add Correct column

SA Comparison tab now shows:
- Correct: TN ops (SA correctly detected PARTIAL/UNSUPPORTED)
- Partial FA: SA said PARTIAL but EP fully handled (less severe)
- Unsup FA: SA said UNSUPPORTED but EP handled (more severe)
Extensionless ONNX external data files (weight tensors) were not being
cleaned up, causing 10+ GB disk usage per large model.

@DingmaomaoBJTU DingmaomaoBJTU left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thorough review of the SA eval pipeline. Found several correctness bugs and edge-case gaps — details in the inline comments below.

Comment thread scripts/e2e_eval/gen_eval_model.py Outdated
Comment thread scripts/e2e_eval/gen_eval_model.py
Comment thread scripts/e2e_eval/run_sa_eval.py
Comment thread scripts/e2e_eval/run_sa_eval.py Outdated
Comment thread scripts/e2e_eval/run_sa_eval.py
Comment thread scripts/e2e_eval/run_sa_eval.py Outdated
Comment thread scripts/e2e_eval/sa_comparison.py Outdated
- gen_eval_model: handle TimeoutExpired on proc.wait after kill
- gen_eval_model: fix cache-skip for composite models (export_*.onnx)
- run_sa_eval: cleanup_onnx_artifacts recurses subdirs, handles OSError
- run_sa_eval: fix stage_sa_pre docstring (graph_optimized, not export)
- run_sa_eval: --report-only respects --registry for report metadata
- sa_comparison: guard get_optimization_config when ep_found=False
Comment thread scripts/e2e_eval/gen_eval_model.py Fixed
Comment thread scripts/e2e_eval/run_sa_eval.py Fixed
…ation

- Add explanatory comments to empty except clauses (CodeQL)
- Build cache now validates quantized/compiled artifacts when requested
--ep now accepts both short alias (qnn, dml) and full ORT name
(QNNExecutionProvider), matching run_eval.py behavior.

@timenick timenick left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three minor observations on the new pipeline — all low-severity.

🤖 Generated with GitHub Copilot CLI

Comment thread scripts/e2e_eval/sa_report.py Outdated
Comment thread scripts/e2e_eval/sa_report.py Outdated
Comment thread scripts/e2e_eval/run_sa_eval.py
@xieofxie

xieofxie commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

🟠 stage_build cache logic — misleading log + likely incomplete rebuild

scripts/e2e_eval/run_sa_eval.py:682-690:

if use_cache and is_cached(export_path) and is_cached(optimized_path):
    if run_quantize and not is_cached(model_dir / "quantized.onnx"):
        safe_print("  [Build] Cache incomplete (missing quantized.onnx), rebuilding...")
    elif run_compile and not is_cached(model_dir / "compiled.onnx"):
        safe_print("  [Build] Cache incomplete (missing compiled.onnx), rebuilding...")
    else:
        ...
        return None

The "rebuilding..." log falls through to the build path, but --rebuild is only passed when not use_cache (line 731). So we tell the user we're rebuilding while in fact calling winml build without --rebuild — the behavior depends on whatever winml build does when artifacts exist. Either pass --rebuild here, or change the log to say "resuming build for missing artifacts."

- sa_report: fix _final_perf using `or` which treats 0.0 as falsy
- sa_report: show red for negative avg perf gain (regression)
- run_sa_eval: log message when non-NPU auto-skips quantize
- Pass --rebuild when cache is incomplete (missing quantized/compiled)
- Change log from "rebuilding" to "resuming build" for clarity
- Updated PR description to match current implementation
@DingmaomaoBJTU DingmaomaoBJTU enabled auto-merge (squash) June 9, 2026 06:08
@DingmaomaoBJTU DingmaomaoBJTU merged commit c6376a9 into main Jun 9, 2026
9 checks passed
@DingmaomaoBJTU DingmaomaoBJTU deleted the qiowu/add_sa_eval branch June 9, 2026 06:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants