Skip to content

feat(qwen3): genai bundle generation#996

Merged
DingmaomaoBJTU merged 30 commits into
mainfrom
pr/836/feature/qwen3-quant
Jul 3, 2026
Merged

feat(qwen3): genai bundle generation#996
DingmaomaoBJTU merged 30 commits into
mainfrom
pr/836/feature/qwen3-quant

Conversation

@DingmaomaoBJTU

@DingmaomaoBJTU DingmaomaoBJTU commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds onnxruntime-genai bundle generation for winml-exported Qwen3 transformer-only models. A single scripts/qwen3.py export command builds all four bundle components (transformer ctx/iter, embeddings, lm_head) and assembles them into an onnxruntime-genai directory (genai_config.json + HF tokenizer files). The bundle-assembly machinery is split into an EP-agnostic core and a Qwen3/QNN-specific layer.

Inference over the assembled bundle is handled by the winml-genai runtime (winml perf --runtime winml-genai, already on main); this PR is generation-only. A dedicated genai inference session lands in a follow-up.

What's new

File Change
src/winml/modelkit/utils/genai.py New generic, EP-agnostic core: PipelineStage, DecoderIOMapping, build_genai_config(), build_decoder_pipeline_stages(), write_genai_bundle()
src/winml/modelkit/models/hf/qwen3/genai.py New Qwen3 + QNN layer: build_qwen3_transformer_only_stages(), write_genai_bundle() wrapper, qnn_stage_session_options() (QNN HTP routing)
scripts/qwen3.py New unified export CLI (export subcommand)
scripts/export_qwen3_transformer_only.py, scripts/export_qwen3_embeddings_lm_head.py Deleted — superseded by scripts/qwen3.py
src/winml/modelkit/onnx/utils.py Adds strip_node_attrs() (drops exporter-injected default GQA attributes)
src/winml/modelkit/models/hf/qwen3/qwen3_modeling.py max_rope_len export hook (prepared; defaults to current behavior)
tests/unit/models/qwen3/test_genai_config.py, test_qwen3_modeling.py, tests/unit/onnx/test_utils.py New unit tests

Design

utils/genai.py is architecture-agnostic. build_genai_config takes a list[PipelineStage] and a DecoderIOMapping — no tensor names or EP details are hardcoded. build_decoder_pipeline_stages introspects the built ctx.onnx/iter.onnx (_introspect_onnx_io + _detect_format_patterns) to discover past_keys_%d / present_values_%d-style patterns from the actual graph I/O, so tensor names can never drift from what the ONNX really contains.

models/hf/qwen3/genai.py is the Qwen3/QNN-specific layer. build_qwen3_transformer_only_stages wires the four Qwen3 stages, and qnn_stage_session_options emits the QNN HTP session_options for the transformer stages — all QNN specifics live here, not in the generic core. Its write_genai_bundle wraps the generic assembler so the one-shot API is unchanged.

build_genai_config(hf_config, ..., pipeline=stages, decoder_io=decoder_io)   <- generic (utils/genai.py)
build_qwen3_transformer_only_stages(ctx_onnx, iter_onnx, num_layers)          <- Qwen3/QNN (models/hf/qwen3/genai.py)

Usage

# Build all four components + assemble the genai bundle (transformer on NPU/QNN):
uv run python scripts/qwen3.py export --model-id Qwen/Qwen3-0.6B --device npu --output out/bundle

# Reuse pre-built companion ONNX files instead of rebuilding them:
uv run python scripts/qwen3.py export --model-id Qwen/Qwen3-0.6B \
  --embeddings path/to/embeddings.onnx --lm-head path/to/lm_head.onnx --output out/bundle

# Benchmark / run the assembled bundle (runtime already on main):
uv run winml perf -m out/bundle --runtime winml-genai

Notes

  • embeddings.onnx (fp32) and lm_head.onnx (w4a32 MatMulNBits) are built by the script on CPU; --embeddings / --lm-head override them with pre-built files.
  • Transformer ctx/iter stages route to the QNN HTP (NPU) via per-stage session_options in genai_config.json; embeddings/lm_head stay on CPU.
  • Verified end-to-end: winml perf -m out/bundle --runtime winml-genai installs the WinML QNN EP, runs the transformer stages on the NPU, and generates correctly.

Follows from #836.

Comment thread src/winml/modelkit/models/hf/qwen3/genai.py Fixed
Comment thread src/winml/modelkit/session/genai_session.py Fixed
Comment thread src/winml/modelkit/session/genai_session.py Fixed
github-actions Bot added 15 commits July 1, 2026 10:41
- src/winml/modelkit/models/hf/qwen3/genai.py: new module with
  build_genai_config() and write_genai_bundle(). build_genai_config
  generates the onnxruntime-genai pipeline config JSON from a HF
  PretrainedConfig + max_cache_len + prefill_seq_len. write_genai_bundle
  copies the winml-built ctx/iter ONNX, optional placeholder embeddings
  and lm_head ONNX, saves tokenizer files from HF, and writes
  genai_config.json.

- scripts/export_qwen3_transformer_only.py: add --genai-bundle DIR,
  --embeddings ONNX, --lm-head ONNX flags. When --genai-bundle is set,
  write_genai_bundle is called after the build to emit a complete
  onnxruntime-genai bundle.

- scripts/infer_genai.py: new inference script. Loads the genai bundle
  with og.Config, registers WinML EPs (QNN), and runs greedy generation
  via og.Generator. Supports --ep cpu|qnn, --chat template wrapping,
  --max-new, --context-length, --verbose.

- src/winml/modelkit/models/hf/qwen3/__init__.py: export
  build_genai_config and write_genai_bundle.

- tests/unit/models/qwen3/test_genai_config.py: 21 unit tests for
  build_genai_config covering pipeline structure, KV name counts,
  tensor name constants, edge cases (list eos_token_id, missing head_dim,
  None pad_token_id, custom filenames, variable layer count).
…tion

Replace hardcoded tensor-name constants with a data-driven design:

- PipelineStage dataclass: carries name, filename, run_on_prompt/token_gen,
  inputs, outputs, is_lm_head. Callers construct stages explicitly; no
  tensor names are baked into build_genai_config itself.

- DecoderIOMapping dataclass: holds the %d-style format strings that genai
  uses to expand per-layer KV tensor names. Defaults match Qwen3 naming
  but any naming convention is supported.

- build_genai_config: now takes pipeline: list[PipelineStage] and
  decoder_io: DecoderIOMapping. Architecture-agnostic; no Qwen3-specific
  logic. prefill_seq_len=None omits the sliding_window section.

- _introspect_onnx_io: reads graph.input / graph.output from an ONNX
  model without loading external data weights.

- _detect_format_patterns: scans tensor names for indexed groups matching
  <prefix><int> with exactly num_layers consecutive zero-based indices,
  returns {prefix: 'prefix%d'} patterns.

- build_qwen3_transformer_only_stages: Qwen3-specific factory that calls
  _introspect_onnx_io on the built ctx/iter ONNX, detects KV patterns via
  _detect_format_patterns, and returns (list[PipelineStage], DecoderIOMapping).
  Tensor names can never drift from the actual ONNX graph I/O.

- write_genai_bundle: delegates to build_qwen3_transformer_only_stages
  instead of hardcoding names.

Tests (35 total, all pass):
- TestBuildGenaiConfig: +2 new cases (no sliding_window, custom DecoderIOMapping)
- TestDetectFormatPatterns: 6 new unit tests for the pattern detector
- TestBuildQwen3TransformerOnlyStages: 6 new tests using patched
  _introspect_onnx_io (no real ONNX files required)
- GenaiSession drives og.Model + og.Generator lifecycle for autoregressive
  text generation; peer class to WinMLSession (not a subclass)
- GenerationConfig dataclass: temperature, top_p, top_k, max_new_tokens,
  repetition_penalty, do_sample
- Lazy onnxruntime_genai import via _import_og() — class importable without
  the package installed (raises GenaiNotInstalledError on first use)
- Reuses WinMLEPRegistry for EP discovery/registration (idempotent)
- EP support: cpu (clear_providers only), qnn, dml
- context_length read from genai_config.json; overridable at construction
- generate_streaming() yields decoded token strings; generator del'd in finally
- generate() returns joined string; auto-load on first call if not loaded
- 33 unit tests; all use patch.dict(sys.modules) to avoid real hardware
- Moves chat template logic from infer_genai.py into GenaiSession
- Supports optional system prompt
- ChatML is not Qwen3-specific; used by Qwen2/3, Yi, Mistral, etc.
- infer_genai.py _wrap_chat_template now delegates to the static method
- Updated --chat flag help text and script docstring
- 4 new tests covering user-only, with-system, no-system-turn, assistant-priming
- PipelineStage gains session_options: dict | None = None field;
  PipelineStage.to_dict() emits it when set
- Add _qnn_stage_session_options(log_id, soc_model) helper that
  produces QNN HTP provider_options for a pipeline stage
- build_qwen3_transformer_only_stages gains ep='cpu' and soc_model='60'
  params; when ep='qnn' the context and iterator stages receive QNN
  session_options, embeddings and lm_head stay on CPU (no session_options)
- write_genai_bundle threads ep/soc_model through
- export_qwen3_transformer_only.py passes ep='qnn' when --device npu
- 5 new tests covering cpu/qnn ep routing and soc_model propagation
  (39 total, all pass)
Remove clear_providers/append_provider calls from GenaiSession.load().
EP placement is fully driven by per-stage session_options in genai_config.json.
clear_providers() only clears the top-level provider and cannot override
per-stage session_options embedded in the pipeline config.

- Add 'mixed' EP (use genai_config.json as-is; default for infer_genai.py)
- _NEEDS_WINML_EPS covers mixed/qnn/dml to trigger EP registration
- Replace _EP_PROVIDER_MAP with _VALID_EPS + _NEEDS_WINML_EPS sets
- Update tests: remove append_provider assertions, add mixed/config-not-modified tests
- infer_genai.py default EP changed from 'cpu' to 'mixed'

Result: NPU bundle (out/qwen3_bundle_npu) now runs at 9.3 tok/s vs 1.2 tok/s CPU
- GenaiSession gains compile=True parameter
- _prepare_compiled_bundle(): detects QNN stages from genai_config.json,
  compiles each stage to EPContext ONNX via ort.ModelCompiler in a subprocess
- _compile_stage(): 5-minute timeout per stage to handle QNN SDK hang
  (known bug: w8a16 + multi-token prefill hangs indefinitely)
- Compiled artifacts cached in bundle_dir/_compiled/; reused on subsequent runs
- _mirror_non_onnx_files(): symlinks/copies tokenizer files so og.Config
  can load from the compiled sub-directory
- infer_genai.py --compile flag wired through to GenaiSession
…on_optimization_mode=0

Root cause: QNN SDK ModelCompiler deadlocks when compiling w8a16 quantized
ONNX with multi-token static input shapes (seq_len > 1) at graph finalization
optimization levels 1-3. The genai_config uses level 3 for runtime inference,
which triggers the hang when passed to ModelCompiler directly.

Fix: _compile_stage now forces htp_graph_finalization_optimization_mode=0 for
compilation. This lets ModelCompiler finish (ctx ~41s, iter ~67s) while runtime
inference still uses the full level-3 optimization from genai_config (EPContext
loading bypasses compilation entirely, so the runtime option is irrelevant).

Also fixes:
- Pipeline stage detection: genai_config uses 'qnn' key (not 'QNNExecutionProvider')
  in provider_options; detection and option extraction now uses the correct key
- _patch_stage_filename: genai_config pipeline is a list, not a dict; updated
  to iterate list entries correctly
- _prepare_compiled_bundle: passes QNN provider options from each stage's
  session_options to _compile_stage so soc_model, backend_path, etc. are respected
- Removed the 'prefill fallback to JIT' warning since the hang is now fixed
… spawn

Windows multiprocessing spawn serialises the subprocess target via pickle.
Local functions (closures) defined inside a method cannot be pickled, which
caused 'AttributeError: Can't pickle local function' at runtime.

Moved the compilation logic to a module-level function _qnn_compile_worker
so it is importable by name in the spawned subprocess.

Also fix ONNX filename in compiled genai_config: use ctx_onnx.name (just the
filename) instead of str(ctx_onnx) (absolute path).  ort-genai resolves
filenames relative to the directory passed to og.Config, so an absolute path
causes double-path concatenation and a 'file not found' error.
… stages

Previously _compile_stage forced mode='0' for ALL stages to avoid a QNN SDK
deadlock on w8a16 + multi-token prefill. This also silently capped the iter
(generation) stage at mode 0, producing under-optimized kernels (~10 tok/s).

Fix: only force mode=0 for prefill stages (run_on_prompt=true, seq_len>1
where the deadlock occurs). Generation stages (run_on_token_gen=true,
seq_len=1) use the configured mode from genai_config.json (typically '3'),
which is safe for single-token input and produces fully-optimized kernels.

Performance:
  Before: 10.4 tok/s (both ctx+iter compiled with mode 0)
  After:  43.4 tok/s (ctx mode 0, iter mode 3) — matches reference ~45 tok/s

_prepare_compiled_bundle now passes is_prefill flag per stage based on
run_on_prompt / run_on_token_gen fields in genai_config.json pipeline config.
… _compile_stage

The original mode=0 override was added to avoid a QNN SDK deadlock when
compiling w8a16 prefill (seq_len>1) at higher optimization levels.

Testing revealed the deadlock only occurs when QNN provider options are
NOT passed to ort.ModelCompiler at all (causing it to fall back to a
broken default path). With correct QNN options (backend_path, soc_model,
etc.) forwarded, mode=3 compiles successfully for both ctx (~73s) and
iter (~67s) with no hang.

Remove the is_prefill flag and mode override entirely. _compile_stage now
passes genai_config QNN options unchanged, giving fully-optimized kernels
for all stages.

Performance (hot NPU, EPContext loaded):
  ctx+iter both mode=3: ~44.5 tok/s vs reference ~45 tok/s
…i as shim

- Extract all architecture-agnostic logic (PipelineStage, DecoderIOMapping,
  build_genai_config, build_decoder_pipeline_stages, write_genai_bundle,
  qnn_stage_session_options, ONNX introspection helpers) into
  src/winml/modelkit/utils/genai.py so other model families can reuse it
- Reduce qwen3/genai.py to a thin re-export shim with a backward-compatible
  build_qwen3_transformer_only_stages alias for existing callers
- fix(codeql): remove unused _TOKENIZER_FILES from utils/genai.py
- fix(codeql): remove unnecessary del generator in GenaiSession.generate_streaming
- fix(codeql): add missing Protocol body ellipsis in QuantConfigFinalizer.finalize
- fix(codeql): import get_quant_finalizer directly in quant/__init__.py
- fix(test): update mock patch path to winml.modelkit.utils.genai._introspect_onnx_io
- fix(test): replace bare 'import onnx' with 'from onnx import ...' in
  test_qwen3_calibration.py
- fix(_mirror_non_onnx_files): skip .onnx/.data files to avoid duplicating
  multi-GB model weights into _compiled/ on first --compile run
- fix(generate_streaming): restore try/finally around og.Generator so the
  KV cache buffer is freed immediately on early caller exit (GeneratorExit),
  not deferred until GC
- fix(build_genai_config): preserve eos_token_id list unchanged — ORT genai
  accepts a JSON array; truncating to [0] silently discards secondary stop
  tokens (e.g. Qwen3's [151645, 151643])
- fix(build_decoder_pipeline_stages): use name-based KV pattern matching
  ('key'/'val' in prefix) instead of purely positional, so models that list
  past_values before past_keys in their ONNX graph don't get a silent swap
- fix(qwen3/genai __all__): remove private _detect_format_patterns from
  __all__; tests now import it directly from winml.modelkit.utils.genai
- test: update test_eos_token_id_list_preserved to expect full list
… in perf print

- src/: convert all absolute winml.modelkit.* imports to relative
  - qwen3/genai.py: from ....utils.genai import
  - utils/genai.py: from ..onnx import copy_onnx_model
  - session/genai_session.py: from .ep_registry / from ..winml (subprocess worker)
- genai_session: patch failed-stage filename to absolute src path so
  ort-genai can resolve it when loading from compiled_dir (was crashing)
- infer_genai.py: guard n/dt with max(dt,1e-9) to avoid ZeroDivisionError
- tests: import GenaiSession symbols from package __init__ not submodule
… model types

Now that qwen3_embeddings_only and qwen3_lm_head_only are available (merged
from main via PR #1008), remove the placeholder pattern from the genai bundle
assembly:

- export_qwen3_transformer_only.py: when --genai-bundle is set, automatically
  build embeddings (fp32) and lm_head (w4a32/MatMulNBits) via WinMLAutoModel
  if --embeddings / --lm-head override paths are not provided
- --embeddings / --lm-head flags are kept as optional override paths for callers
  that want to supply a pre-built ONNX instead of building from model_id
- Both companion models are built on CPU (task=feature-extraction, no_compile)
  since they run on CPU in the genai pipeline
- Drop the now-stale WARNING messages about missing embeddings/lm_head
@DingmaomaoBJTU DingmaomaoBJTU force-pushed the pr/836/feature/qwen3-quant branch from 8364969 to f729f92 Compare July 1, 2026 02:45
Comment thread src/winml/modelkit/models/hf/qwen3/genai.py Fixed
Comment thread src/winml/modelkit/quant/calibration/base.py Fixed
Comment thread src/winml/modelkit/utils/genai.py
…verride; fix shape_config

- utils/genai.py: add _patch_seq_dim_dynamic helper; apply it to both
  embeddings and lm_head ONNX after copy in write_genai_bundle — ort-genai
  calls these models with prompt_len tokens on prefill and seq_len=1 on each
  decode step, so the seq_len dimension must be symbolic not fixed
- session/genai_session.py: revert _prepare_cpu_bundle and the ep==cpu hook
  (GenaiSession uses genai_config.json as-is; cpu override not supported)
- export_qwen3_transformer_only.py: remove shape_config from companion
  build call — embeddings/lm_head have dynamic seq_len, no static shape needed
Comment thread src/winml/modelkit/utils/genai.py
github-actions Bot added 2 commits July 1, 2026 12:57
_mirror_non_onnx_files previously skipped ALL .onnx/.data files, which
meant embeddings.onnx and lm_head.onnx were inaccessible when ort-genai
loaded from _compiled/.  Now only the QNN-compiled stage files (those
in qnn_stages) are excluded; CPU-side ONNX files are symlinked into the
compiled bundle directory so ort-genai can find them.

Also pass compiled_onnx_names from _prepare_compiled_bundle to the
mirror helper so the skip set is driven by what was actually compiled.

Verified: --compile produces valid EPContext for ctx/iter stages,
embeddings.onnx and lm_head.onnx are symlinked, inference runs at
~37 tok/s on Snapdragon X Elite NPU.
Consolidate three scripts into a single unified CLI with sub-commands:

  qwen3.py export   -- full genai bundle build (transformer + embeddings
                       + lm_head), replaces export_qwen3_transformer_only.py
                       and export_qwen3_embeddings_lm_head.py
  qwen3.py infer    -- onnxruntime-genai streamed inference,
                       replaces infer_genai.py

Deleted:
  scripts/export_qwen3_embeddings_lm_head.py  (obsolete since #1008
    integrated embeddings/lm_head into the main export pipeline)
  scripts/export_qwen3_transformer_only.py
  scripts/infer_genai.py

Changes:
  - Default --device is now npu (was cpu) to match the primary use-case
  - Default --max-cache-len is now 2048 (aligns with reference bundle)
  - --output replaces --genai-bundle for clarity
  - --bundle replaces --model-dir in the infer sub-command
  - --compile in export triggers EPContext pre-compilation via GenaiSession
    context-manager (no private API access)
  - node summary covers both transformer (GQA/QDQ) and companion models
    (Gather/MatMulNBits)
Comment thread src/winml/modelkit/session/genai_session.py Outdated
Comment thread src/winml/modelkit/session/genai_session.py Outdated
Comment thread src/winml/modelkit/utils/genai.py
Comment thread scripts/qwen3.py Outdated
Comment thread scripts/qwen3.py Outdated
github-actions Bot added 6 commits July 3, 2026 09:01
Keep utils/genai.py execution-provider-agnostic: build_decoder_pipeline_stages and write_genai_bundle now take opaque context/iterator session_options supplied by the caller instead of an ep/soc_model pair, and qnn_stage_session_options is removed.

The QNN HTP session_options move into the Qwen3 module (models/hf/qwen3/genai.py), which wraps the generic builders so the emitted genai_config.json stays byte-identical to before.

Remove session/genai_session.py and its test (covered by a separate PR); session/__init__.py no longer exports the genai session symbols. scripts/qwen3.py becomes export-only (drop the infer subcommand, --compile, and the GenaiSession import).
…reference model)

Switch activation_type from uint16 to uint8 to align with the reference
qwen3-genai-share model (w8a8 QDQ, int8 weights + uint8 activations).

This keeps ctx.onnx / iter.onnx at opset 18 instead of opset 21.
ORT forces opset >= 21 for 16-bit QDQ (uint16), so the previous uint16
choice caused an automatic opset bump to 21 that deviated from the
reference graph layout.

Update test name and assertion accordingly.
Revert the uint8 change. uint16 activations give better generation quality
at the cost of opset 21 (required by ORT for 16-bit QDQ). This is the
correct precision for the QNN NPU pipeline.
Add strip_node_attrs() to winml.modelkit.onnx — a generic utility that
removes all attributes from matching op nodes except those listed in a
keep_attrs set. Operates in-place on an onnx.ModelProto; safe for models
with external data (modifies only the graph proto, not weight files).

Wire it into write_genai_bundle() via a new transformer_onnx_passes
parameter: a list of callables applied to ctx.onnx / iter.onnx after
they are copied into the bundle directory.

In scripts/qwen3.py, pass _strip_gqa_default_attrs (which retains only
do_rotary / kv_num_heads / num_heads) to remove the five extra attrs
that PyTorch's TorchScript ONNX exporter injects from the ORT
com.microsoft::GroupQueryAttention schema:
  k_quant_type, local_window_size, qk_output, smooth_softmax, v_quant_type

These are all no-op defaults and are absent from the reference model;
stripping them brings our bundle's GQA attribute set in line with the
reference.

8 new unit tests cover: extra-attr removal, keep-all, remove-all,
domain mismatch no-op, multi-node graphs, and identity (same object
returned).
WinMLQwen3Attention.forward was calling rotary_emb with
torch.arange(config.max_position_embeddings) = 40960 positions,
producing a 40960x64 cos/sin cache constant in every exported ONNX.
The reference model uses a 4096x64 cache (= max_cache_len).

Fix: use total_seq_len.item() (which equals max_cache_len at trace time,
as set by _TransformerOnlySeqLenGenerator) instead of
config.max_position_embeddings.  This produces a cache of exactly
max_cache_len rows — matching what will actually be needed at inference
time and 10x smaller for the default Qwen3-0.6B export.

Falls back to config.max_position_embeddings when total_seq_len is None
(e.g. eager evaluation outside the export path).

4 new tests verify rope cache sizing across multiple max_cache_len
values and the None fallback.
torch.export.export (used by torch.onnx.export at opset 18+) treats
int(total_seq_len.item()) as Sym(u0), causing torch.arange(Sym(u0))
to resolve to arange(1) at trace-time, baking cos_cache=[1,64] into
the ONNX graph instead of the intended [40960,64].

Fix: WinMLQwen3Attention.prepare_for_onnx_export now stores _max_rope_len
as a plain Python int (defaults to config.max_position_embeddings=40960).
forward() uses torch.arange(self._max_rope_len) -- a concrete int literal
that both the TorchScript and torch.export backends bake as a constant
tensor, giving cos_cache shape [40960, 64] which satisfies GQA's check
cos_cache.shape[0] >= total_seq_len for any total_seq_len <= 40960.

apply_transformer_only_export_prep accepts an optional max_rope_len kwarg
to allow callers to override the rope length (e.g., to match max_cache_len).

Tests updated: replaced total_seq_len-based assertions with _max_rope_len
attribute tests and forward() rope-length-independence tests.
Comment thread scripts/qwen3.py Fixed
Comment thread scripts/qwen3.py Fixed
Comment thread src/winml/modelkit/utils/genai.py Fixed
Comment thread src/winml/modelkit/utils/genai.py Fixed
Comment thread src/winml/modelkit/utils/genai.py Fixed
Comment thread src/winml/modelkit/utils/genai.py
Comment thread tests/unit/onnx/test_utils.py Fixed
github-actions Bot added 3 commits July 3, 2026 09:50
Remove the leftover 'from winml.modelkit.session import GenaiSession, GenerationConfig' import (those symbols were removed with genai_session.py, so the import would crash the export script) and the unused _SUPPORTED_EPS global. Both were flagged by CodeQL. Also correct the ctx/iter docstring from 'QNN-quantized' to 'QDQ-quantized' per review feedback.
Brings in winml perf --runtime winml-genai bundle benchmarking (#1015) and
drops stale pre-#836 leftovers from the PR diff by advancing the merge-base.
The Qwen3 write_genai_bundle wrapper did not accept or forward the transformer_onnx_passes argument used by the generic assembler and the export script (scripts/qwen3.py), so 'qwen3 export' crashed at bundle assembly with a TypeError. Add the parameter and forward it verbatim to winml.modelkit.utils.genai.write_genai_bundle. Add regression tests covering the pass-through and the ep-derived session_options forwarding.
@DingmaomaoBJTU DingmaomaoBJTU marked this pull request as ready for review July 3, 2026 04:11
@DingmaomaoBJTU DingmaomaoBJTU requested a review from a team as a code owner July 3, 2026 04:11
github-actions Bot added 2 commits July 3, 2026 12:29
…t alerts

- build_genai_config: keep a valid pad_token_id of 0 instead of falling back
  to bos_token_id (the `... or bos` form treated the falsy 0 as unset).
- utils/genai.py: use a TYPE_CHECKING `from onnx import ModelProto` for the
  transformer_onnx_passes annotation so the module no longer carries a second
  `import onnx` (clears CodeQL py/repeated-import).
- quant/calibration/base.py: drop the redundant `...` after the Protocol
  method docstring (clears CodeQL py/ineffectual-statement).
- tests/unit/onnx/test_utils.py: import onnx symbols via a single `from onnx
  import ...` (clears CodeQL py/import-and-import-from).
- tests: add a regression test asserting pad_token_id==0 is preserved.
…y wrapper

Explain why QwenTransformerOnlyDecoderWrapper leaves max_rope_len at its
default: threading the build's max_cache_len down to this load-time hook needs
generic model-loader plumbing, which is deferred to the follow-up PR. The
apply_transformer_only_export_prep(..., max_rope_len=...) path is already
implemented and unit-tested.
Comment thread src/winml/modelkit/models/hf/qwen3/genai.py
Comment thread src/winml/modelkit/onnx/utils.py
@xieofxie

xieofxie commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

The PR description is stale and should be refreshed before merge. The "What's new" table references files that aren't in the actual diff: scripts/export_qwen3_transformer_only.py is actually deleted (not modified), scripts/infer_genai.py is not in this PR, and the --genai-bundle / --embeddings / --lm-head flags on the export script don't exist. The real change instead adds scripts/qwen3.py and src/winml/modelkit/utils/genai.py. The Usage and Notes sections also describe an inference script that isn't part of this PR. A reviewer relying on the description will be misled — please update it to match the actual file set.

@xieofxie

xieofxie commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

main() in scripts/qwen3.py ignores the parsed subcommand — it always hard-dispatches to _cmd_export(args) regardless of args.command. It works today with a single export subcommand, but the add_subparsers scaffolding implies more commands are coming, and the next one added would silently run export instead. Consider wiring dispatch via p.set_defaults(func=_cmd_export) (and return args.func(args)) so each subcommand routes to its own handler.

@DingmaomaoBJTU

Copy link
Copy Markdown
Collaborator Author

main() in scripts/qwen3.py ignores the parsed subcommand — it always hard-dispatches to _cmd_export(args) regardless of args.command. It works today with a single export subcommand, but the add_subparsers scaffolding implies more commands are coming, and the next one added would silently run export instead. Consider wiring dispatch via p.set_defaults(func=_cmd_export) (and return args.func(args)) so each subcommand routes to its own handler.

Sure~ Fixed. Btw in next pr I will remove this script~

@DingmaomaoBJTU DingmaomaoBJTU changed the title feat(qwen3): genai bundle generation and inference script feat(qwen3): genai bundle generation Jul 3, 2026
main() hard-dispatched to _cmd_export regardless of the parsed
subcommand, so any future subcommand added to the add_subparsers
scaffold would silently run export. Register each subparser's handler
with set_defaults(func=...) and dispatch through args.func(args).
@DingmaomaoBJTU

Copy link
Copy Markdown
Collaborator Author

Refreshed the PR title + description to match the actual diff (thanks for the catch @xieofxie). Dropped the stale bits — scripts/export_qwen3_transformer_only.py is deleted, and there's no scripts/infer_genai.py / --genai-bundle flag / inference script in this PR. The What's new table, Usage, and Notes now reflect the real files: scripts/qwen3.py (the unified export CLI), the generic src/winml/modelkit/utils/genai.py core, and the Qwen3/QNN src/winml/modelkit/models/hf/qwen3/genai.py layer. Also noted that inference is handled by the winml-genai runtime on main, not a script in this PR.

The main() dispatch fix is pushed in 272126e (p.set_defaults(func=_cmd_export) + return args.func(args)), so each subcommand routes to its own handler and a missing subcommand errors out.

@DingmaomaoBJTU DingmaomaoBJTU merged commit 7004125 into main Jul 3, 2026
9 checks passed
@DingmaomaoBJTU DingmaomaoBJTU deleted the pr/836/feature/qwen3-quant branch July 3, 2026 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants