Skip to content

research(adding-model-support): iter-1..6 skill snapshot (producer/reviewer methodology)#935

Closed
ssss141414 wants to merge 10 commits into
mainfrom
shzhen/skills_poc
Closed

research(adding-model-support): iter-1..6 skill snapshot (producer/reviewer methodology)#935
ssss141414 wants to merge 10 commits into
mainfrom
shzhen/skills_poc

Conversation

@ssss141414

Copy link
Copy Markdown
Contributor

research(adding-model-support): iter-1..6 skill snapshot

Adds the producer/reviewer skill for shipping new HuggingFace model recipes
into the WinML-ModelKit catalog. Two-agent flow (Producer SKILL.md + Reviewer
REVIEW.md) to defend against single-agent self-grading (_meta-005, _meta-006,
_meta-007).

Layout:

  • SKILL.md producer guide, Steps 0-7 (Step 7 = ship the PR)
  • REVIEW.md reviewer's fail-closed checklist
  • model_knowledge/.json per-arch validated findings (bart, marian,
    vision-encoder-decoder, m2m_100, pix2struct,
    depth_pro, mgp_str, vilt)
  • skill_meta/findings.json 33 methodology findings (_meta-001 .. _meta-033)
    covering Optimum-coverage probe, composite gate,
    external-data layout, L3 verdict triage,
    methodology-evolution contract, PR-shipment lanes
  • iter5_summary.md, iter6_summary.md batch retrospectives
  • iter6_reports/ PR-description mirrors for bart-large-mnli + vit-gpt2

Outcome contract (per _meta-031, _meta-032, _meta-033):

  • Every contribution declares (Effort, Goal, Outcome) tier
  • Every Outcome ships a structured 9-item contribution report = PR description
  • Every Outcome ships a real github.com PR via Step 7 Lane B (branch off main)
  • Skill-level updates push to this working branch (Lane A); model PRs branch
    off main with scope-per-Effort-tier discipline

Iter-6 first L3 PASS: bart-large-mnli accuracy=0.88 on glue/mnli/100 CPU.

This branch is the methodology workbench; per-model recipes ship via their own
narrow PRs off main (shzhen/add-bart-large-mnli-recipe,
shzhen/add-vit-gpt2-image-captioning-recipe).

github-actions Bot and others added 10 commits June 15, 2026 10:29
Adds research/autoconfig/ — an automated config search POC that sweeps
opset versions (17-21), execution providers, and graph optimizations to
find the best winml-cli build config for a given model on Windows hardware.

Key findings from 8-model QNN NPU catalog sweep:
- npu-001: opset 21 bypass gives +25-31% on Conv+residual models (MobileViT, DINOv2)
- npu-006: conv fusions (conv-bn/add/activation) cause 4900% regression on ResNet-18 QNN NPU
- npu-007: DVFS thermal noise requires session-level averaging (3x500 iters) for reliable results

Includes ep_knowledge/ KB with confirmed findings per EP, and catalog-qnn-sweep/
with per-model benchmark results and cross-model pattern analysis.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds research/autoconfig/docs/agent-design.md — strategic design for
the agent layer of winml-cli, covering:

- winml-cli vs Olive distinction (UX + Windows-first + explainability)
- Why autoconfig search is a sub-tool, not the agent entry point
- 5 agent types: Diagnostic, Decision Guidance, Cross-Device Confidence,
  Regression Detection, Model Recommendation
- Autoconfig's role within the agent framework
- Key concerns and open questions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds research/autoconfig/docs/skills-design.md — full design doc for
the winml-cli skills/agent layer, including:

- 11 skill designs (use-winml-cli, optimize-for-device,
  ep-compatibility-check, debug-accuracy-drop, and others)
- Competitive analysis (Apple coremltools, ExecuTorch, AI Hub,
  NVIDIA ModelOpt, OpenVINO, Olive)
- Top 5 feature gaps
- Validation confidence levels (L1-L5)
- Structured output requirements
- QNN NPU catalog sweep findings (npu-001/006/007)
- FusedConv unfuse feature request

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ping skills

- Split skill catalog into two ranked categories by the 'does it touch code?'
  discriminator: User (config-only) and Contributor (code changes)
- Merge overlapping skills (12 -> 9):
  - check-model-feasibility = find-a-model + ep-compatibility-check
  - ship-to-winapp = validate-before-ship + prepare-for-winapp
  - autoconfig absorbs optimize-for-device as its manual mode
- Add self-contained HTML render of the design doc for easier reading
Critical issues found and corrected:

npu-001 (opset 21 speedup):
- mechanism_confirmed changed TRUE → FALSE
  The kMaxSupportedOpset bypass requires ORT < 1.18; the sweep used
  onnxruntime-windowsml 1.24.5 where kMaxSupportedOpset >= 22. The bypass
  mechanism does not apply. The speedup for DINOv2/MobileViT is empirically
  real but the WHY is now unknown.
- ResNet-18 removed from 'benefits' list — sub-ms model, 3-session ranges
  span 4x for the same config (pure DVFS noise). Reported +20.2% was noise.
- MobileViT magnitude corrected: h1 had DVFS spike inflating median to 11.72ms;
  actual gain is ~20-26% not 26.5%.
- DINOv2 finding kept: 3-session data shows non-overlapping distributions.
- Added per-session raw data analysis and required follow-up experiments.

npu-002 / npu-003 (W8A16 speedup, compile speedup):
- scope changed from 'General / all vision models' to 'ConvNext only'
  (both findings from 1 model; magnitude claims not transferable)
- confidence reduced from 'high' to 'medium'

npu-004 (W8A8 accuracy collapse):
- confidence changed from 'medium' to 'very_low / anecdote'
- Finding has NO recorded data (experiment 'aborted early, numbers not saved')
  Cannot be treated as a KB finding until re-run with recorded numbers.

npu-005 (QNN Hub comparison):
- Added fairness caveat: comparing qairt-stack model on ORT QNN EP is
  not a valid comparison. Finding is trivially true (use right tool for
  right stack) but not informative.

npu-006 (conv fusions catastrophic):
- No confidence change — this is the most statistically solid finding.
- Added session-level evidence note: h4 CV=0.016 (extremely stable, unusual
  for QNN NPU), consistent with deterministic CPU fallback hypothesis.

search_space_rules:
- opset recommendation changed from 'Conv+residual' to 'Conv+attention hybrid'
  to reflect actual validated models (DINOv2 is attention-dominant, not
  Conv+residual in the traditional sense)

New file: docs/ep-knowledge-review.md
- Full statistical analysis of per-session data
- ORT version dependency explained
- Additional models needed for validation
- Minimum experiment protocol

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eneral ViT

Run validation_sweep.py across 3 new models to rigorously test npu-001
(opset21 speedup) and npu-006 (conv fusion regression) hypotheses.

KEY FINDINGS:

npu-001 (opset21 speedup):
- facebook/dinov2-base: +24.1% (opset17 34.56ms -> opset21 26.23ms)
  3-session full bench, fresh quantized.onnx builds, very stable
- microsoft/rad-dino: -0.1% NEUTRAL -- model runs on CPU (~275ms),
  QNN NPU cannot accelerate ViT-L; opset irrelevant when CPU-bound
- facebook/dino-vitb16: -0.7% NEUTRAL -- critical control proving the
  speedup is NOT a general ViT property; DINOv2-specific op patterns
  must explain the difference

Combined with original catalog data:
  dinov2-small +30.6%, dinov2-base +24.1% (both confirmed)
  dino-vitb16 NEUTRAL (confirmed control) -> scope is DINOv2 family

npu-006 (conv fusions):
- dinov2-base: fusions -25% (faster) -- attention-dominant, benign
- dino-vitb16: fusions +1% (neutral) -- no meaningful Conv ops to fuse
  Combined with original resnet-18 +4900% -> hazard is conv-density-gated

Script fixes in validation_sweep.py:
- bench_screen parsed d.get('p50_ms') instead of d['latency_ms']['p50']
- Reuse check accepted any .onnx (including truncated export.onnx)
- Model selection preferred optimized.onnx over quantized.onnx

Updated files:
- ep_knowledge/qnn_npu.json: npu-001 scope narrowed to DINOv2-family,
  validated_models expanded with dino-vitb16 (negative control) and
  dinov2-base (positive), rad-dino (CPU-bound); npu-006 scope updated
- catalog-qnn-sweep/VALIDATION_SUMMARY.md: full cross-model results table
- catalog-qnn-sweep/{dinov2-base,rad-dino,dino-vitb16}/results_v2.json
- catalog-qnn-sweep/.gitignore: exclude val_h*/ build artifact dirs
…nism invalidated, confidence calibrated

Merge structural improvements from local review into KB (smart merge,
preserving validation sweep data from 2026-06-16):

npu-001:
- Add mechanism_invalidation field (explicit statement of INVALIDATION
  with cause: ORT 1.24.5 kMaxSupportedOpset>=22, bypass does not apply)
- Add critical_caveats array (4 caveats incl. DINOv2-specific scope note)
- Downgrade confidence to 'medium-high on empirical / low on mechanism'
  (was 'high' which was overclaiming given unknown mechanism)

npu-002/003:
- Add follow_up_required fields (FP32 baselines on MobileViT/DINOv2/ResNet)

npu-004:
- Update action_for_autoconfig: 'Do NOT use to skip W8A8 without running
  eval first' (was 'Treat as potentially risky' which was still prescriptive
  without data)

search_space_rules:
- Rename recommended_order_conv_attention_hybrid -> recommended_order_conv_residual
  to match local review terminology

NOTE: Validation sweep data (dinov2-base +24.1%, dino-vitb16 NEUTRAL,
rad-dino CPU-bound) from 2026-06-16 is preserved — not overwritten.
…d NOT Transpose elimination

Task 3 investigation: loaded dinov2-small opset17 (h0) and opset21 (h3) optimized.onnx
and quantized.onnx from catalog_qnn_sweep builds; counted op types with onnx.load().

Key finding: Transpose count is IDENTICAL (49 nodes) in both opsets.
  - opset17 optimized: 391 total, 49 Transpose, 121 Reshape
  - opset21 optimized: 439 total, 49 Transpose, 169 Reshape (+48)
  - opset17 quantized: 1398 total, 49 Transpose, 615 DQ, 392 Q
  - opset21 quantized: 1542 total, 49 Transpose, 663 DQ, 440 Q (+48 QDQ pairs)

Rules out: NHWC Transpose-elimination as speedup cause, fewer-ops as explanation.
Consistent with: QNN EP scheduling/partitioning difference triggered by +48 Reshape nodes.

Also: kMaxSupportedOpset confirmed >= 23 in ORT 1.24.4 (C:\\tmp env),
reaffirming that the original bypass mechanism does NOT apply.

Updated npu-001 critical_caveats, follow_up_required, and added
transpose_analysis_2026_06_16 section with raw op counts.
…DINOv2-specific

New benchmark results (2026-06-17, QNN NPU Snapdragon X Elite, 3x500-iter W8A16):

BAAI/bge-small-en-v1.5 (BERT/sentence-similarity):
  h0=10.617ms [10.52, 10.32, 11.01]  h3=9.840ms [10.25, 9.33, 9.94]
  opset21 gain +7.3% -- MARGINAL / INCONCLUSIVE (CV=0.3, ranges barely non-overlapping)
  Unusual vs all other NLP models (distilbert -0.1%, MiniLM -0.7%, roberta +0.1%)
  Needs 5+ sessions to differentiate from DVFS noise.

rizvandwiki/gender-classification (plain ViT):
  h0=14.326ms [14.15, 14.94, 13.89]  h3=13.830ms [13.70, 13.92, 13.87]
  opset21 gain +3.5% -- NEUTRAL (ranges overlap 13.89/13.92ms, CV=0.35)
  CRITICAL FINDING: this ViT model has IDENTICAL op counts to DINOv2-small
  (49 Transpose, 121 Reshape, ~72 Gemm) yet shows NO benefit. Confirms npu-001
  is not explainable by op-count profiles or general ViT architecture.

Combined with Transpose analysis (Task 3): opset17 and opset21 DINOv2-small have
identical Transpose node counts (49). The speedup mechanism is NOT Transpose
elimination. The effect is specific to DINOv2 family at a level below op-count
visibility -- possibly quantization behavior, tensor layout, or QNN EP partitioning.

Also updated: models_tested list (+5 entries), validated_models sections,
scope and confidence statements, task completion notes in follow_up_required.
Adds the producer/reviewer skill for shipping new HuggingFace model recipes
into the WinML-ModelKit catalog. Two-agent flow (Producer SKILL.md + Reviewer
REVIEW.md) to defend against single-agent self-grading (_meta-005, _meta-006,
_meta-007).

Layout:
- SKILL.md     producer guide, Steps 0-7 (Step 7 = ship the PR)
- REVIEW.md    reviewer's fail-closed checklist
- model_knowledge/<family>.json   per-arch validated findings (bart, marian,
                                  vision-encoder-decoder, m2m_100, pix2struct,
                                  depth_pro, mgp_str, vilt)
- skill_meta/findings.json   33 methodology findings (_meta-001 .. _meta-033)
                             covering Optimum-coverage probe, composite gate,
                             external-data layout, L3 verdict triage,
                             methodology-evolution contract, PR-shipment lanes
- iter5_summary.md, iter6_summary.md   batch retrospectives
- iter6_reports/   PR-description mirrors for bart-large-mnli + vit-gpt2

Outcome contract (per _meta-031, _meta-032, _meta-033):
- Every contribution declares (Effort, Goal, Outcome) tier
- Every Outcome ships a structured 9-item contribution report = PR description
- Every Outcome ships a real github.com PR via Step 7 Lane B (branch off main)
- Skill-level updates push to this working branch (Lane A); model PRs branch
  off main with scope-per-Effort-tier discipline

Iter-6 first L3 PASS: bart-large-mnli accuracy=0.88 on glue/mnli/100 CPU.

This branch is the methodology workbench; per-model recipes ship via their own
narrow PRs off main (shzhen/add-bart-large-mnli-recipe,
shzhen/add-vit-gpt2-image-captioning-recipe).
@ssss141414 ssss141414 requested a review from a team as a code owner June 23, 2026 03:09
@ssss141414 ssss141414 closed this Jun 23, 2026
ssss141414 added a commit that referenced this pull request Jun 23, 2026
…y, no PR-to-main)

Iter-6 closed PR #935 (Lane A skill snapshot \u2192 main) confirmed the over-application of
`gh pr create`. Lane A is push-to-working-branch ONLY; opening a PR against `main` for
skill-only changes is now an explicit anti-pattern in SKILL.md Step 7 and `_meta-033`.

Files: SKILL.md (Step 7 Lane A bullets) + skill_meta/findings.json (`_meta-033`: new gotcha,
mechanism_confirmed=partial, resolution updated).
import json


results = json.load(open(r"ablation-search\results.json"))
if complete_models:
print(f" [reuse] existing build in {hyp_dir.name}", flush=True)
ok = True
build_out = "(reused)"
p50 = lat.get("p50") if isinstance(lat, dict) else None
if p50:
p50s.append(round(p50, 3))
except Exception:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants