research(adding-model-support): iter-1..6 skill snapshot (producer/reviewer methodology) by ssss141414 · Pull Request #935 · microsoft/winml-cli

ssss141414 · 2026-06-23T03:09:20Z

research(adding-model-support): iter-1..6 skill snapshot

Adds the producer/reviewer skill for shipping new HuggingFace model recipes
into the WinML-ModelKit catalog. Two-agent flow (Producer SKILL.md + Reviewer
REVIEW.md) to defend against single-agent self-grading (_meta-005, _meta-006,
_meta-007).

Layout:

SKILL.md producer guide, Steps 0-7 (Step 7 = ship the PR)
REVIEW.md reviewer's fail-closed checklist
model_knowledge/.json per-arch validated findings (bart, marian,
vision-encoder-decoder, m2m_100, pix2struct,
depth_pro, mgp_str, vilt)
skill_meta/findings.json 33 methodology findings (_meta-001 .. _meta-033)
covering Optimum-coverage probe, composite gate,
external-data layout, L3 verdict triage,
methodology-evolution contract, PR-shipment lanes
iter5_summary.md, iter6_summary.md batch retrospectives
iter6_reports/ PR-description mirrors for bart-large-mnli + vit-gpt2

Outcome contract (per _meta-031, _meta-032, _meta-033):

Every contribution declares (Effort, Goal, Outcome) tier
Every Outcome ships a structured 9-item contribution report = PR description
Every Outcome ships a real github.com PR via Step 7 Lane B (branch off main)
Skill-level updates push to this working branch (Lane A); model PRs branch
off main with scope-per-Effort-tier discipline

Iter-6 first L3 PASS: bart-large-mnli accuracy=0.88 on glue/mnli/100 CPU.

This branch is the methodology workbench; per-model recipes ship via their own
narrow PRs off main (shzhen/add-bart-large-mnli-recipe,
shzhen/add-vit-gpt2-image-captioning-recipe).

Adds research/autoconfig/ — an automated config search POC that sweeps opset versions (17-21), execution providers, and graph optimizations to find the best winml-cli build config for a given model on Windows hardware. Key findings from 8-model QNN NPU catalog sweep: - npu-001: opset 21 bypass gives +25-31% on Conv+residual models (MobileViT, DINOv2) - npu-006: conv fusions (conv-bn/add/activation) cause 4900% regression on ResNet-18 QNN NPU - npu-007: DVFS thermal noise requires session-level averaging (3x500 iters) for reliable results Includes ep_knowledge/ KB with confirmed findings per EP, and catalog-qnn-sweep/ with per-model benchmark results and cross-model pattern analysis. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds research/autoconfig/docs/agent-design.md — strategic design for the agent layer of winml-cli, covering: - winml-cli vs Olive distinction (UX + Windows-first + explainability) - Why autoconfig search is a sub-tool, not the agent entry point - 5 agent types: Diagnostic, Decision Guidance, Cross-Device Confidence, Regression Detection, Model Recommendation - Autoconfig's role within the agent framework - Key concerns and open questions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds research/autoconfig/docs/skills-design.md — full design doc for the winml-cli skills/agent layer, including: - 11 skill designs (use-winml-cli, optimize-for-device, ep-compatibility-check, debug-accuracy-drop, and others) - Competitive analysis (Apple coremltools, ExecuTorch, AI Hub, NVIDIA ModelOpt, OpenVINO, Olive) - Top 5 feature gaps - Validation confidence levels (L1-L5) - Structured output requirements - QNN NPU catalog sweep findings (npu-001/006/007) - FusedConv unfuse feature request Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ping skills - Split skill catalog into two ranked categories by the 'does it touch code?' discriminator: User (config-only) and Contributor (code changes) - Merge overlapping skills (12 -> 9): - check-model-feasibility = find-a-model + ep-compatibility-check - ship-to-winapp = validate-before-ship + prepare-for-winapp - autoconfig absorbs optimize-for-device as its manual mode - Add self-contained HTML render of the design doc for easier reading

Critical issues found and corrected: npu-001 (opset 21 speedup): - mechanism_confirmed changed TRUE → FALSE The kMaxSupportedOpset bypass requires ORT < 1.18; the sweep used onnxruntime-windowsml 1.24.5 where kMaxSupportedOpset >= 22. The bypass mechanism does not apply. The speedup for DINOv2/MobileViT is empirically real but the WHY is now unknown. - ResNet-18 removed from 'benefits' list — sub-ms model, 3-session ranges span 4x for the same config (pure DVFS noise). Reported +20.2% was noise. - MobileViT magnitude corrected: h1 had DVFS spike inflating median to 11.72ms; actual gain is ~20-26% not 26.5%. - DINOv2 finding kept: 3-session data shows non-overlapping distributions. - Added per-session raw data analysis and required follow-up experiments. npu-002 / npu-003 (W8A16 speedup, compile speedup): - scope changed from 'General / all vision models' to 'ConvNext only' (both findings from 1 model; magnitude claims not transferable) - confidence reduced from 'high' to 'medium' npu-004 (W8A8 accuracy collapse): - confidence changed from 'medium' to 'very_low / anecdote' - Finding has NO recorded data (experiment 'aborted early, numbers not saved') Cannot be treated as a KB finding until re-run with recorded numbers. npu-005 (QNN Hub comparison): - Added fairness caveat: comparing qairt-stack model on ORT QNN EP is not a valid comparison. Finding is trivially true (use right tool for right stack) but not informative. npu-006 (conv fusions catastrophic): - No confidence change — this is the most statistically solid finding. - Added session-level evidence note: h4 CV=0.016 (extremely stable, unusual for QNN NPU), consistent with deterministic CPU fallback hypothesis. search_space_rules: - opset recommendation changed from 'Conv+residual' to 'Conv+attention hybrid' to reflect actual validated models (DINOv2 is attention-dominant, not Conv+residual in the traditional sense) New file: docs/ep-knowledge-review.md - Full statistical analysis of per-session data - ORT version dependency explained - Additional models needed for validation - Minimum experiment protocol Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…eneral ViT Run validation_sweep.py across 3 new models to rigorously test npu-001 (opset21 speedup) and npu-006 (conv fusion regression) hypotheses. KEY FINDINGS: npu-001 (opset21 speedup): - facebook/dinov2-base: +24.1% (opset17 34.56ms -> opset21 26.23ms) 3-session full bench, fresh quantized.onnx builds, very stable - microsoft/rad-dino: -0.1% NEUTRAL -- model runs on CPU (~275ms), QNN NPU cannot accelerate ViT-L; opset irrelevant when CPU-bound - facebook/dino-vitb16: -0.7% NEUTRAL -- critical control proving the speedup is NOT a general ViT property; DINOv2-specific op patterns must explain the difference Combined with original catalog data: dinov2-small +30.6%, dinov2-base +24.1% (both confirmed) dino-vitb16 NEUTRAL (confirmed control) -> scope is DINOv2 family npu-006 (conv fusions): - dinov2-base: fusions -25% (faster) -- attention-dominant, benign - dino-vitb16: fusions +1% (neutral) -- no meaningful Conv ops to fuse Combined with original resnet-18 +4900% -> hazard is conv-density-gated Script fixes in validation_sweep.py: - bench_screen parsed d.get('p50_ms') instead of d['latency_ms']['p50'] - Reuse check accepted any .onnx (including truncated export.onnx) - Model selection preferred optimized.onnx over quantized.onnx Updated files: - ep_knowledge/qnn_npu.json: npu-001 scope narrowed to DINOv2-family, validated_models expanded with dino-vitb16 (negative control) and dinov2-base (positive), rad-dino (CPU-bound); npu-006 scope updated - catalog-qnn-sweep/VALIDATION_SUMMARY.md: full cross-model results table - catalog-qnn-sweep/{dinov2-base,rad-dino,dino-vitb16}/results_v2.json - catalog-qnn-sweep/.gitignore: exclude val_h*/ build artifact dirs

…nism invalidated, confidence calibrated Merge structural improvements from local review into KB (smart merge, preserving validation sweep data from 2026-06-16): npu-001: - Add mechanism_invalidation field (explicit statement of INVALIDATION with cause: ORT 1.24.5 kMaxSupportedOpset>=22, bypass does not apply) - Add critical_caveats array (4 caveats incl. DINOv2-specific scope note) - Downgrade confidence to 'medium-high on empirical / low on mechanism' (was 'high' which was overclaiming given unknown mechanism) npu-002/003: - Add follow_up_required fields (FP32 baselines on MobileViT/DINOv2/ResNet) npu-004: - Update action_for_autoconfig: 'Do NOT use to skip W8A8 without running eval first' (was 'Treat as potentially risky' which was still prescriptive without data) search_space_rules: - Rename recommended_order_conv_attention_hybrid -> recommended_order_conv_residual to match local review terminology NOTE: Validation sweep data (dinov2-base +24.1%, dino-vitb16 NEUTRAL, rad-dino CPU-bound) from 2026-06-16 is preserved — not overwritten.

…d NOT Transpose elimination Task 3 investigation: loaded dinov2-small opset17 (h0) and opset21 (h3) optimized.onnx and quantized.onnx from catalog_qnn_sweep builds; counted op types with onnx.load(). Key finding: Transpose count is IDENTICAL (49 nodes) in both opsets. - opset17 optimized: 391 total, 49 Transpose, 121 Reshape - opset21 optimized: 439 total, 49 Transpose, 169 Reshape (+48) - opset17 quantized: 1398 total, 49 Transpose, 615 DQ, 392 Q - opset21 quantized: 1542 total, 49 Transpose, 663 DQ, 440 Q (+48 QDQ pairs) Rules out: NHWC Transpose-elimination as speedup cause, fewer-ops as explanation. Consistent with: QNN EP scheduling/partitioning difference triggered by +48 Reshape nodes. Also: kMaxSupportedOpset confirmed >= 23 in ORT 1.24.4 (C:\\tmp env), reaffirming that the original bypass mechanism does NOT apply. Updated npu-001 critical_caveats, follow_up_required, and added transpose_analysis_2026_06_16 section with raw op counts.

…DINOv2-specific New benchmark results (2026-06-17, QNN NPU Snapdragon X Elite, 3x500-iter W8A16): BAAI/bge-small-en-v1.5 (BERT/sentence-similarity): h0=10.617ms [10.52, 10.32, 11.01] h3=9.840ms [10.25, 9.33, 9.94] opset21 gain +7.3% -- MARGINAL / INCONCLUSIVE (CV=0.3, ranges barely non-overlapping) Unusual vs all other NLP models (distilbert -0.1%, MiniLM -0.7%, roberta +0.1%) Needs 5+ sessions to differentiate from DVFS noise. rizvandwiki/gender-classification (plain ViT): h0=14.326ms [14.15, 14.94, 13.89] h3=13.830ms [13.70, 13.92, 13.87] opset21 gain +3.5% -- NEUTRAL (ranges overlap 13.89/13.92ms, CV=0.35) CRITICAL FINDING: this ViT model has IDENTICAL op counts to DINOv2-small (49 Transpose, 121 Reshape, ~72 Gemm) yet shows NO benefit. Confirms npu-001 is not explainable by op-count profiles or general ViT architecture. Combined with Transpose analysis (Task 3): opset17 and opset21 DINOv2-small have identical Transpose node counts (49). The speedup mechanism is NOT Transpose elimination. The effect is specific to DINOv2 family at a level below op-count visibility -- possibly quantization behavior, tensor layout, or QNN EP partitioning. Also updated: models_tested list (+5 entries), validated_models sections, scope and confidence statements, task completion notes in follow_up_required.

Adds the producer/reviewer skill for shipping new HuggingFace model recipes into the WinML-ModelKit catalog. Two-agent flow (Producer SKILL.md + Reviewer REVIEW.md) to defend against single-agent self-grading (_meta-005, _meta-006, _meta-007). Layout: - SKILL.md producer guide, Steps 0-7 (Step 7 = ship the PR) - REVIEW.md reviewer's fail-closed checklist - model_knowledge/<family>.json per-arch validated findings (bart, marian, vision-encoder-decoder, m2m_100, pix2struct, depth_pro, mgp_str, vilt) - skill_meta/findings.json 33 methodology findings (_meta-001 .. _meta-033) covering Optimum-coverage probe, composite gate, external-data layout, L3 verdict triage, methodology-evolution contract, PR-shipment lanes - iter5_summary.md, iter6_summary.md batch retrospectives - iter6_reports/ PR-description mirrors for bart-large-mnli + vit-gpt2 Outcome contract (per _meta-031, _meta-032, _meta-033): - Every contribution declares (Effort, Goal, Outcome) tier - Every Outcome ships a structured 9-item contribution report = PR description - Every Outcome ships a real github.com PR via Step 7 Lane B (branch off main) - Skill-level updates push to this working branch (Lane A); model PRs branch off main with scope-per-Effort-tier discipline Iter-6 first L3 PASS: bart-large-mnli accuracy=0.88 on glue/mnli/100 CPU. This branch is the methodology workbench; per-model recipes ship via their own narrow PRs off main (shzhen/add-bart-large-mnli-recipe, shzhen/add-vit-gpt2-image-captioning-recipe).

…y, no PR-to-main) Iter-6 closed PR #935 (Lane A skill snapshot \u2192 main) confirmed the over-application of `gh pr create`. Lane A is push-to-working-branch ONLY; opening a PR against `main` for skill-only changes is now an explicit anti-pattern in SKILL.md Step 7 and `_meta-033`. Files: SKILL.md (Step 7 Lane A bullets) + skill_meta/findings.json (`_meta-033`: new gotcha, mechanism_confirmed=partial, resolution updated).

+import json
+
+
+results = json.load(open(r"ablation-search\results.json"))


+        if complete_models:
+            print(f"  [reuse] existing build in {hyp_dir.name}", flush=True)
+            ok = True
+            build_out = "(reused)"


+            p50 = lat.get("p50") if isinstance(lat, dict) else None
+            if p50:
+                p50s.append(round(p50, 3))
+        except Exception:


github-actions Bot and others added 10 commits June 15, 2026 10:29

ssss141414 requested a review from a team as a code owner June 23, 2026 03:09

ssss141414 closed this Jun 23, 2026

github-advanced-security AI found potential problems Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research(adding-model-support): iter-1..6 skill snapshot (producer/reviewer methodology)#935

research(adding-model-support): iter-1..6 skill snapshot (producer/reviewer methodology)#935
ssss141414 wants to merge 10 commits into
mainfrom
shzhen/skills_poc

ssss141414 commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		import json


		results = json.load(open(r"ablation-search\results.json"))

Conversation

ssss141414 commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants