Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -298,6 +298,8 @@ lint.per-file-ignores."tests/**" = [ "ANN", "D", "PLR2004", "PT", "S101", "T20"
lint.per-file-ignores."tests/**/generate_patterns.py" = [ "PERF401" ]
# Generated opset code: Allow long lines
lint.per-file-ignores."src/winml/modelkit/analyze/onnx_opset/**" = [ "D", "E501", "N802", "N803", "N806", "TC001", "TC002", "TC003" ]
# Research scripts: POC code, not production — exempt from all style/type/security rules
lint.per-file-ignores."research/**" = [ "ANN", "D", "E", "N", "S", "T20", "UP", "W", "B", "C4", "FA", "I", "PERF", "PIE", "PT", "PTH", "RET", "RSE", "RUF", "SIM", "TCH", "TID", "TRY", "G", "ICN", "E402", "E501", "F401", "F403", "F811" ]
# === Import Conventions ===
lint.flake8-bandit.check-typed-exception = true
lint.flake8-bandit.hardcoded-tmp-directory = [ "/tmp", "/var/tmp", "C:\\Temp" ]
Expand Down
135 changes: 135 additions & 0 deletions research/adding-model-support/REVIEW.md

Large diffs are not rendered by default.

488 changes: 488 additions & 0 deletions research/adding-model-support/SKILL.md

Large diffs are not rendered by default.

98 changes: 98 additions & 0 deletions research/adding-model-support/iter5_summary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Iter-5 — Two-agent workflow trial on 10 candidates (producer side, post-reviewer)

**Date**: 2026-06-22 PM, immediately after the producer/reviewer split (`_meta-007`).

**Producer**: this session.
**Reviewer**: invoked via Explore subagent after the initial 3-of-10 run; issued REQUEST_CHANGES with 5 actionable items. Producer then completed the items in option-B mode (see "Reviewer-driven completions" below). First real exercise of REVIEW.md — also surfaced `_meta-011` (reviewer-tool-budget gap) and `_meta-013` (`winml analyze` parquet rules missing on external hosts).

## What the user asked

> "好,你用新的架构,再试试之前的10个model"

The "new architecture" = SKILL.md Steps 0-6 + REVIEW.md (the producer/reviewer separation introduced this morning). The "10 models" = the 12 candidates from iter-1's audit minus the 2 already-shipped (depth_pro done, BLIP is out-of-scope this turn).

## Pre-build per-model tier table (the `_meta-010` requirement, captured retroactively)

This is the table the reviewer demanded under `_meta-010`. Every row is now classified as exactly one of {RUN, BLOCKED-UPSTREAM, OUT-OF-SCOPE-FOR-TURN}. Iter-5's original sin was running 3 then writing the summary; iter-5-after-review reframes:

| # | model_type | example HF id | iter-4 probe verdict | Classification | Outcome |
|---|---|---|---|---|---|
| 1 | bart | facebook/bart-large-mnli | VENDOR-ONLY | RUN | VALIDATED-NEGATIVE; recipe checked in with `_status: BROKEN` marker per `_meta-013` (bart-003) |
| 2 | marian | Helsinki-NLP/opus-mt-en-ru | VENDOR-ONLY (w/ winml override) | RUN | VALIDATED (encoder + decoder structurally validated; marian-003) |
| 3 | marian | Helsinki-NLP/opus-mt-fr-en | sibling of #2 | RUN (added after reviewer item #5) | VALIDATED (marian-004); confirms marian-003 template is reusable across opus-mt checkpoints |
| 4 | m2m_100 | facebook/nllb-200-distilled-600M | VENDOR-ONLY | OUT-OF-SCOPE-FOR-TURN | 600M model, build wall-time too large for the turn budget; recipe pattern is identical to marian-003 (vendor-covered text2text-generation); ship as a follow-up |
| 5 | mgp_str | alibaba-damo/mgp-str-base | UNREGISTERED for `mgp_str`; vendor has `mgp-str` feature-extraction only | OUT-OF-SCOPE-FOR-TURN | Requires writing an OnnxConfig subclass with 3-head outputs (L1-light); violates the "no new code" budget of this turn |
| 6 | pix2struct | google/pix2struct-ai2d-base | VENDOR-ONLY | RUN | VALIDATED-NEGATIVE at config stage (pix2struct-003); workaround (a) `--shape-config` attempted per reviewer item #4 and confirmed dead (pix2struct-004); workaround (b) hand-written recipe deferred to a separate turn (would land as pix2struct-005) |
| 7 | pix2struct | google/deplot | as #6 | BLOCKED-UPSTREAM | Same `winml config` refusal as #6 — gated on the same upstream gap (winml CLI AutoProcessor threading); no per-checkpoint exception expected |
| 8 | pix2struct | google/pix2struct-docvqa-base | as #6 | BLOCKED-UPSTREAM | Same as #7 |
| 9 | vilt | dandelin/vilt-b32-finetuned-vqa | UNREGISTERED | OUT-OF-SCOPE-FOR-TURN | True L1 — write OnnxConfig from scratch against transformers source. Also "first VQA contribution" per `_meta-003`; needs a `TASK_REGISTRY` decision (classification-VQA vs generative-VQA) before any code lands |
| 10 | vision-encoder-decoder | breezedeus/pix2text-mfr | VENDOR + WINML adds | RUN (added after reviewer item #5) | VALIDATED-NEGATIVE at fetch stage (vision-encoder-decoder-003); HF repo lacks standard weight files; recipe checked in with `_status: BROKEN` marker per `_meta-013`. Architecture-level L0★ path remains open for any standard VED checkpoint (e.g. nlpconnect/vit-gpt2-image-captioning); blocked only for THIS specific checkpoint |

**Score (post-reviewer)**: 5 RUN (3 VALIDATED + 2 VALIDATED-NEGATIVE with recipes + markers), 2 BLOCKED-UPSTREAM (pix2struct siblings — gated on same upstream gap), 3 OUT-OF-SCOPE-FOR-TURN (m2m_100 / mgp_str / vilt — explicitly deferred with reasons). Compared to the first iter-5 pass (3 of 10 with no classification of the unrun rows), this table now satisfies `_meta-010`.

## Reviewer-driven completions (option-B execution log)

The Explore subagent acting as reviewer issued REQUEST_CHANGES with 5 items. The producer executed all 5:

1. **Reviewer item #1 — "recipe schema violation in marian-003"**. ROOT CAUSE: SKILL.md was wrong, not the recipes. Real `WinMLBuildConfig` schema is `{loader, export, optim, quant, compile, eval}` (verified at `src/winml/modelkit/config/build.py:96-200`); SKILL.md previously claimed `{export, optim, quant, loader, eval}`, missing `compile` and incorrectly implying `eval` is required. SKILL.md Step 3 + REVIEW.md Outcome-L0 row rewritten. Captured as `_meta-012`.
2. **Reviewer item #2 — "re-run analyze with available EP"**. ROOT CAUSE: BLOCKED on host. `src/winml/modelkit/analyze/rules/runtime_check_rules/` contains only `README.md` on external hosts; `scripts/download_rules.py` is Microsoft-internal-only. `winml analyze --ep cpu ...` fails with "No runtime rule parquet files were found". REVIEW.md updated to add the parquet-availability caveat. Captured as `_meta-013`.
3. **Reviewer item #3 — "bart recipe should be marked broken"**. RESOLUTION: top-level `_status: "BROKEN — ..."` field added to `examples/recipes/facebook_bart-large-mnli/text-classification_fp16_config.json`. `WinMLBuildConfig.from_dict` uses `.get()` for known keys and silently ignores unknown ones, so the marker is safe. Convention also applied to `examples/recipes/breezedeus_pix2text-mfr/`. REVIEW.md gained a "known-broken recipe convention" check.
4. **Reviewer item #4 — "pix2struct must attempt at least one workaround"**. RESOLUTION: workaround (a) `--shape-config` attempted; confirmed dead — flag only accepts text/vision/audio dims, NO `max_patches`/`patch_dim` key. Captured as pix2struct-004 with full mechanism; the hand-written-recipe path documented as pix2struct-005 deferred.
5. **Reviewer item #5 — "expand the 3-of-10 sample"**. RESOLUTION: ran fr-en (marian-004 VALIDATED) and pix2text-mfr (vision-encoder-decoder-003 VALIDATED-NEGATIVE at fetch stage). The 7 unrun rows are now classified per the table above with explicit reasons.

## What the validated runs taught

### Marian (POSITIVE — 2 checkpoints validated)

- Producer prediction (iter-4): L0★, L0 reachable.
- Reality: L0★ confirmed for opus-mt-en-ru (marian-003) AND opus-mt-fr-en (marian-004). Recipe pair (`translation_fp16_encoder_config.json` + `translation_fp16_decoder_config.json` from `winml config --task translation`) generalises across opus-mt checkpoints with no manual edits — `winml config` auto-fills the per-checkpoint vocab.
- Artifacts mined per SKILL.md Step 4 (en-ru): encoder 204 nodes / 51.2M params / autoconf optim = {clamp_constant_values, gelu_fusion, matmul_add_fusion, remove_isnan_in_attention_mask}; decoder 392 nodes / 76.7M params / same optim. ScatterND on decoder KV-cache writes is the dominant "unknown" op in per-EP coverage — file as a per-EP rule gap.
- fr-en numbers: encoder 34.0s / 199 MB (external-data layout kicks in above en-ru size); decoder 42.3s / 346 MB.

### BART (NEGATIVE — first methodology counter-example to "probe ⇒ build succeeds")

- Producer prediction (iter-4): L0★, L0 reachable. (Optimum `BartOnnxConfig` covers text-classification.)
- Reality: build FAILS at export with `index -1 is out of bounds for dimension 1 with size 0`. Likely cause: `BartForSequenceClassification` pools the encoder hidden state at the last `eos_token_id` position; random int32 dummy input never contains an eos token, so `nonzero()` returns empty and `[-1]` indexing throws.
- **This falsifies the iter-4 methodology assumption that "Optimum coverage ⇒ build will succeed"**. Optimum coverage = the OnnxConfig exists. It does NOT = the DummyInputGenerator paired with that OnnxConfig produces inputs that survive checkpoint-specific assertions. Captured as bart-003 + `_meta-008`.

### Pix2Struct (NEGATIVE — second methodology counter-example, autoconfig dead-ends)

- Producer prediction (iter-4): L0★ in principle.
- Reality: `winml config` REFUSES to emit any draft — "Preprocessors for pix2struct need to be available for the ONNX export to infer input static shapes. Got: None". The autoconfig pathway is hard-stopped UPSTREAM of `winml build`, at Optimum's normalized-config layer.
- Workaround (a) `--shape-config` confirmed dead (pix2struct-004); workaround (b) hand-written recipe deferred.
- **New failure class**: "autoconfig path dead-ends before producing a recipe". Captured as pix2struct-003 + `_meta-009`.

### Vision-Encoder-Decoder / pix2text-mfr (NEGATIVE — third methodology counter-example, repo-format gate)

- Producer prediction (iter-4): L0★, L0 reachable. (Optimum + winml both register vision-encoder-decoder; composite-emit produces 2 drafts.)
- Reality: `winml config` SUCCEEDED and emitted both drafts. `winml build` FAILED at fetch with `breezedeus/pix2text-mfr does not appear to have a file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt or flax_model.msgpack.` HF repo stores weights in a non-standard layout.
- **A FOURTH gate** (after probe + winml registration + `winml config` cooperation): HF repo file-layout check. No diagnostic step currently covers this. Captured as vision-encoder-decoder-003 + suggestion to add `huggingface_hub.list_repo_files(...)` pre-flight in `winml config` or a new `winml doctor`. Architecture-level L0★ path remains open for any standard VED checkpoint.

## Methodology lessons captured this turn (now totalling 13 `_meta-*` findings)

1. **Optimum coverage is necessary but not sufficient** — `_meta-008` (bart-003 + vision-encoder-decoder-003 are both counter-examples). The reviewer agent now enforces the build-attempt requirement.
2. **`winml config` can dead-end before producing a draft** — `_meta-009` (pix2struct family + likely fuyu / donut variants). SKILL.md Step 1 verdict table now explicitly documents this gate.
3. **First-mover seq2seq template gap is now closed for marian and is generalisable** — marian-003 → marian-004 confirms the recipe pair pattern transfers across opus-mt checkpoints.
4. **Producer-only on a batch of 10 is still self-grading** — `_meta-010`. SKILL.md now requires a pre-build per-model tier table for batch contributions and REVIEW.md has a corresponding REQUEST_CHANGES rule.
5. **Reviewer subagent without terminal access cannot fully execute REVIEW.md** — `_meta-011`. REVIEW.md now distinguishes REQUIRED-FROM-EVIDENCE checks from REQUIRED-FROM-RE-EXECUTION checks; recommended reviewer-agent invocation pattern documented as a follow-up.
6. **Documentation drift between SKILL.md and `WinMLBuildConfig`** — `_meta-012`. Recommendation: generate the schema callout from the dataclass at doc-build time to eliminate the drift permanently.
7. **Host-provisioning gap for `winml analyze` parquet rules** — `_meta-013`. The fix lives at the cli-distribution layer, not in the methodology text; documented as a host-environment caveat in REVIEW.md.
8. **Recipe-marker convention for known-broken recipes** — top-level `_status` field is the lightest-weight option (no directory move, no separate README); silently accepted by `WinMLBuildConfig.from_dict` because the dataclass uses `.get()`. Documented in REVIEW.md and applied to bart-large-mnli + pix2text-mfr.

## Hand-off package for the next reviewer pass

If a separate reviewer agent (ideally with terminal access this time, per `_meta-011`) picks up this post-option-B state:

- **Recipes to re-verify by re-running `winml build`**:
- `examples/recipes/Helsinki-NLP_opus-mt-en-ru/translation_fp16_{encoder,decoder}_config.json` — should reproduce `✅ Build complete` in ~78s combined.
- `examples/recipes/Helsinki-NLP_opus-mt-fr-en/translation_fp16_{encoder,decoder}_config.json` — should reproduce `✅ Build complete` in ~76s combined (external-data layout; 199 MB encoder + 346 MB decoder).
- `examples/recipes/facebook_bart-large-mnli/text-classification_fp16_config.json` — should reproduce the `index -1` error; `_status` field should NOT affect the build (verify dataclass ignores it).
- `examples/recipes/breezedeus_pix2text-mfr/image-to-text_fp16_{encoder,decoder}_config.json` — should reproduce the "does not appear to have a file named pytorch_model.bin..." error.
- **Findings to audit**: marian-003/004, bart-003, pix2struct-003/004, vision-encoder-decoder-003, all of `_meta-008` through `_meta-013`.
- **Independent verifications the reviewer SHOULD do**:
1. Re-run the Optimum probe and confirm marian/bart/pix2struct/VED verdicts.
2. Re-read all three artifacts (analyze_result.json, export_htp_metadata.json, winml_build_config.json) for the marian fr-en build and confirm marian-004's numbers.
3. Verify that adding a `_status` key to a recipe does NOT change `winml build` behaviour (positive control on marian fr-en after adding a no-op `_status` field).
4. Optionally: attempt pix2struct workaround (b) — hand-written recipe with `flattened_patches[1, 4096, 770]` + `attention_mask[1, 4096]` — to land pix2struct-005.
- **What the post-option-B reviewer CANNOT verify**: `winml analyze` per-EP re-runs (parquet rules missing per `_meta-013`); cite this as a host limitation, not a producer failure.
- **What the producer still owes (out of this turn's scope)**: m2m_100 build (deferred for size), mgp_str OnnxConfig subclass (deferred for L1-light code), vilt full OnnxConfig + TASK_REGISTRY decision (deferred for L1 + new task family), pix2struct workaround (b) attempt (deferred for the hand-written-recipe + processor-threading question).
Loading
Loading