feat(qwen3): genai bundle generation by DingmaomaoBJTU · Pull Request #996 · microsoft/winml-cli

DingmaomaoBJTU · 2026-06-29T08:01:17Z

Summary

Adds onnxruntime-genai bundle generation for winml-exported Qwen3 transformer-only models. A single scripts/qwen3.py export command builds all four bundle components (transformer ctx/iter, embeddings, lm_head) and assembles them into an onnxruntime-genai directory (genai_config.json + HF tokenizer files). The bundle-assembly machinery is split into an EP-agnostic core and a Qwen3/QNN-specific layer.

Inference over the assembled bundle is handled by the winml-genai runtime (winml perf --runtime winml-genai, already on main); this PR is generation-only. A dedicated genai inference session lands in a follow-up.

What's new

File	Change
`src/winml/modelkit/utils/genai.py`	New generic, EP-agnostic core: `PipelineStage`, `DecoderIOMapping`, `build_genai_config()`, `build_decoder_pipeline_stages()`, `write_genai_bundle()`
`src/winml/modelkit/models/hf/qwen3/genai.py`	New Qwen3 + QNN layer: `build_qwen3_transformer_only_stages()`, `write_genai_bundle()` wrapper, `qnn_stage_session_options()` (QNN HTP routing)
`scripts/qwen3.py`	New unified export CLI (`export` subcommand)
`scripts/export_qwen3_transformer_only.py`, `scripts/export_qwen3_embeddings_lm_head.py`	Deleted — superseded by `scripts/qwen3.py`
`src/winml/modelkit/onnx/utils.py`	Adds `strip_node_attrs()` (drops exporter-injected default GQA attributes)
`src/winml/modelkit/models/hf/qwen3/qwen3_modeling.py`	`max_rope_len` export hook (prepared; defaults to current behavior)
`tests/unit/models/qwen3/test_genai_config.py`, `test_qwen3_modeling.py`, `tests/unit/onnx/test_utils.py`	New unit tests

Design

utils/genai.py is architecture-agnostic. build_genai_config takes a list[PipelineStage] and a DecoderIOMapping — no tensor names or EP details are hardcoded. build_decoder_pipeline_stages introspects the built ctx.onnx/iter.onnx (_introspect_onnx_io + _detect_format_patterns) to discover past_keys_%d / present_values_%d-style patterns from the actual graph I/O, so tensor names can never drift from what the ONNX really contains.

models/hf/qwen3/genai.py is the Qwen3/QNN-specific layer. build_qwen3_transformer_only_stages wires the four Qwen3 stages, and qnn_stage_session_options emits the QNN HTP session_options for the transformer stages — all QNN specifics live here, not in the generic core. Its write_genai_bundle wraps the generic assembler so the one-shot API is unchanged.

build_genai_config(hf_config, ..., pipeline=stages, decoder_io=decoder_io)   <- generic (utils/genai.py)
build_qwen3_transformer_only_stages(ctx_onnx, iter_onnx, num_layers)          <- Qwen3/QNN (models/hf/qwen3/genai.py)

Usage

# Build all four components + assemble the genai bundle (transformer on NPU/QNN):
uv run python scripts/qwen3.py export --model-id Qwen/Qwen3-0.6B --device npu --output out/bundle

# Reuse pre-built companion ONNX files instead of rebuilding them:
uv run python scripts/qwen3.py export --model-id Qwen/Qwen3-0.6B \
  --embeddings path/to/embeddings.onnx --lm-head path/to/lm_head.onnx --output out/bundle

# Benchmark / run the assembled bundle (runtime already on main):
uv run winml perf -m out/bundle --runtime winml-genai

Notes

embeddings.onnx (fp32) and lm_head.onnx (w4a32 MatMulNBits) are built by the script on CPU; --embeddings / --lm-head override them with pre-built files.
Transformer ctx/iter stages route to the QNN HTP (NPU) via per-stage session_options in genai_config.json; embeddings/lm_head stay on CPU.
Verified end-to-end: winml perf -m out/bundle --runtime winml-genai installs the WinML QNN EP, runs the transformer stages on the NPU, and generates correctly.

Follows from #836.

- src/winml/modelkit/models/hf/qwen3/genai.py: new module with build_genai_config() and write_genai_bundle(). build_genai_config generates the onnxruntime-genai pipeline config JSON from a HF PretrainedConfig + max_cache_len + prefill_seq_len. write_genai_bundle copies the winml-built ctx/iter ONNX, optional placeholder embeddings and lm_head ONNX, saves tokenizer files from HF, and writes genai_config.json. - scripts/export_qwen3_transformer_only.py: add --genai-bundle DIR, --embeddings ONNX, --lm-head ONNX flags. When --genai-bundle is set, write_genai_bundle is called after the build to emit a complete onnxruntime-genai bundle. - scripts/infer_genai.py: new inference script. Loads the genai bundle with og.Config, registers WinML EPs (QNN), and runs greedy generation via og.Generator. Supports --ep cpu|qnn, --chat template wrapping, --max-new, --context-length, --verbose. - src/winml/modelkit/models/hf/qwen3/__init__.py: export build_genai_config and write_genai_bundle. - tests/unit/models/qwen3/test_genai_config.py: 21 unit tests for build_genai_config covering pipeline structure, KV name counts, tensor name constants, edge cases (list eos_token_id, missing head_dim, None pad_token_id, custom filenames, variable layer count).

…tion Replace hardcoded tensor-name constants with a data-driven design: - PipelineStage dataclass: carries name, filename, run_on_prompt/token_gen, inputs, outputs, is_lm_head. Callers construct stages explicitly; no tensor names are baked into build_genai_config itself. - DecoderIOMapping dataclass: holds the %d-style format strings that genai uses to expand per-layer KV tensor names. Defaults match Qwen3 naming but any naming convention is supported. - build_genai_config: now takes pipeline: list[PipelineStage] and decoder_io: DecoderIOMapping. Architecture-agnostic; no Qwen3-specific logic. prefill_seq_len=None omits the sliding_window section. - _introspect_onnx_io: reads graph.input / graph.output from an ONNX model without loading external data weights. - _detect_format_patterns: scans tensor names for indexed groups matching <prefix><int> with exactly num_layers consecutive zero-based indices, returns {prefix: 'prefix%d'} patterns. - build_qwen3_transformer_only_stages: Qwen3-specific factory that calls _introspect_onnx_io on the built ctx/iter ONNX, detects KV patterns via _detect_format_patterns, and returns (list[PipelineStage], DecoderIOMapping). Tensor names can never drift from the actual ONNX graph I/O. - write_genai_bundle: delegates to build_qwen3_transformer_only_stages instead of hardcoding names. Tests (35 total, all pass): - TestBuildGenaiConfig: +2 new cases (no sliding_window, custom DecoderIOMapping) - TestDetectFormatPatterns: 6 new unit tests for the pattern detector - TestBuildQwen3TransformerOnlyStages: 6 new tests using patched _introspect_onnx_io (no real ONNX files required)

- GenaiSession drives og.Model + og.Generator lifecycle for autoregressive text generation; peer class to WinMLSession (not a subclass) - GenerationConfig dataclass: temperature, top_p, top_k, max_new_tokens, repetition_penalty, do_sample - Lazy onnxruntime_genai import via _import_og() — class importable without the package installed (raises GenaiNotInstalledError on first use) - Reuses WinMLEPRegistry for EP discovery/registration (idempotent) - EP support: cpu (clear_providers only), qnn, dml - context_length read from genai_config.json; overridable at construction - generate_streaming() yields decoded token strings; generator del'd in finally - generate() returns joined string; auto-load on first call if not loaded - 33 unit tests; all use patch.dict(sys.modules) to avoid real hardware

- Moves chat template logic from infer_genai.py into GenaiSession - Supports optional system prompt - ChatML is not Qwen3-specific; used by Qwen2/3, Yi, Mistral, etc. - infer_genai.py _wrap_chat_template now delegates to the static method - Updated --chat flag help text and script docstring - 4 new tests covering user-only, with-system, no-system-turn, assistant-priming

- PipelineStage gains session_options: dict | None = None field; PipelineStage.to_dict() emits it when set - Add _qnn_stage_session_options(log_id, soc_model) helper that produces QNN HTP provider_options for a pipeline stage - build_qwen3_transformer_only_stages gains ep='cpu' and soc_model='60' params; when ep='qnn' the context and iterator stages receive QNN session_options, embeddings and lm_head stay on CPU (no session_options) - write_genai_bundle threads ep/soc_model through - export_qwen3_transformer_only.py passes ep='qnn' when --device npu - 5 new tests covering cpu/qnn ep routing and soc_model propagation (39 total, all pass)

Remove clear_providers/append_provider calls from GenaiSession.load(). EP placement is fully driven by per-stage session_options in genai_config.json. clear_providers() only clears the top-level provider and cannot override per-stage session_options embedded in the pipeline config. - Add 'mixed' EP (use genai_config.json as-is; default for infer_genai.py) - _NEEDS_WINML_EPS covers mixed/qnn/dml to trigger EP registration - Replace _EP_PROVIDER_MAP with _VALID_EPS + _NEEDS_WINML_EPS sets - Update tests: remove append_provider assertions, add mixed/config-not-modified tests - infer_genai.py default EP changed from 'cpu' to 'mixed' Result: NPU bundle (out/qwen3_bundle_npu) now runs at 9.3 tok/s vs 1.2 tok/s CPU

- GenaiSession gains compile=True parameter - _prepare_compiled_bundle(): detects QNN stages from genai_config.json, compiles each stage to EPContext ONNX via ort.ModelCompiler in a subprocess - _compile_stage(): 5-minute timeout per stage to handle QNN SDK hang (known bug: w8a16 + multi-token prefill hangs indefinitely) - Compiled artifacts cached in bundle_dir/_compiled/; reused on subsequent runs - _mirror_non_onnx_files(): symlinks/copies tokenizer files so og.Config can load from the compiled sub-directory - infer_genai.py --compile flag wired through to GenaiSession

…on_optimization_mode=0 Root cause: QNN SDK ModelCompiler deadlocks when compiling w8a16 quantized ONNX with multi-token static input shapes (seq_len > 1) at graph finalization optimization levels 1-3. The genai_config uses level 3 for runtime inference, which triggers the hang when passed to ModelCompiler directly. Fix: _compile_stage now forces htp_graph_finalization_optimization_mode=0 for compilation. This lets ModelCompiler finish (ctx ~41s, iter ~67s) while runtime inference still uses the full level-3 optimization from genai_config (EPContext loading bypasses compilation entirely, so the runtime option is irrelevant). Also fixes: - Pipeline stage detection: genai_config uses 'qnn' key (not 'QNNExecutionProvider') in provider_options; detection and option extraction now uses the correct key - _patch_stage_filename: genai_config pipeline is a list, not a dict; updated to iterate list entries correctly - _prepare_compiled_bundle: passes QNN provider options from each stage's session_options to _compile_stage so soc_model, backend_path, etc. are respected - Removed the 'prefill fallback to JIT' warning since the hang is now fixed

… spawn Windows multiprocessing spawn serialises the subprocess target via pickle. Local functions (closures) defined inside a method cannot be pickled, which caused 'AttributeError: Can't pickle local function' at runtime. Moved the compilation logic to a module-level function _qnn_compile_worker so it is importable by name in the spawned subprocess. Also fix ONNX filename in compiled genai_config: use ctx_onnx.name (just the filename) instead of str(ctx_onnx) (absolute path). ort-genai resolves filenames relative to the directory passed to og.Config, so an absolute path causes double-path concatenation and a 'file not found' error.

… stages Previously _compile_stage forced mode='0' for ALL stages to avoid a QNN SDK deadlock on w8a16 + multi-token prefill. This also silently capped the iter (generation) stage at mode 0, producing under-optimized kernels (~10 tok/s). Fix: only force mode=0 for prefill stages (run_on_prompt=true, seq_len>1 where the deadlock occurs). Generation stages (run_on_token_gen=true, seq_len=1) use the configured mode from genai_config.json (typically '3'), which is safe for single-token input and produces fully-optimized kernels. Performance: Before: 10.4 tok/s (both ctx+iter compiled with mode 0) After: 43.4 tok/s (ctx mode 0, iter mode 3) — matches reference ~45 tok/s _prepare_compiled_bundle now passes is_prefill flag per stage based on run_on_prompt / run_on_token_gen fields in genai_config.json pipeline config.

… _compile_stage The original mode=0 override was added to avoid a QNN SDK deadlock when compiling w8a16 prefill (seq_len>1) at higher optimization levels. Testing revealed the deadlock only occurs when QNN provider options are NOT passed to ort.ModelCompiler at all (causing it to fall back to a broken default path). With correct QNN options (backend_path, soc_model, etc.) forwarded, mode=3 compiles successfully for both ctx (~73s) and iter (~67s) with no hang. Remove the is_prefill flag and mode override entirely. _compile_stage now passes genai_config QNN options unchanged, giving fully-optimized kernels for all stages. Performance (hot NPU, EPContext loaded): ctx+iter both mode=3: ~44.5 tok/s vs reference ~45 tok/s

…i as shim - Extract all architecture-agnostic logic (PipelineStage, DecoderIOMapping, build_genai_config, build_decoder_pipeline_stages, write_genai_bundle, qnn_stage_session_options, ONNX introspection helpers) into src/winml/modelkit/utils/genai.py so other model families can reuse it - Reduce qwen3/genai.py to a thin re-export shim with a backward-compatible build_qwen3_transformer_only_stages alias for existing callers - fix(codeql): remove unused _TOKENIZER_FILES from utils/genai.py - fix(codeql): remove unnecessary del generator in GenaiSession.generate_streaming - fix(codeql): add missing Protocol body ellipsis in QuantConfigFinalizer.finalize - fix(codeql): import get_quant_finalizer directly in quant/__init__.py - fix(test): update mock patch path to winml.modelkit.utils.genai._introspect_onnx_io - fix(test): replace bare 'import onnx' with 'from onnx import ...' in test_qwen3_calibration.py

- fix(_mirror_non_onnx_files): skip .onnx/.data files to avoid duplicating multi-GB model weights into _compiled/ on first --compile run - fix(generate_streaming): restore try/finally around og.Generator so the KV cache buffer is freed immediately on early caller exit (GeneratorExit), not deferred until GC - fix(build_genai_config): preserve eos_token_id list unchanged — ORT genai accepts a JSON array; truncating to [0] silently discards secondary stop tokens (e.g. Qwen3's [151645, 151643]) - fix(build_decoder_pipeline_stages): use name-based KV pattern matching ('key'/'val' in prefix) instead of purely positional, so models that list past_values before past_keys in their ONNX graph don't get a silent swap - fix(qwen3/genai __all__): remove private _detect_format_patterns from __all__; tests now import it directly from winml.modelkit.utils.genai - test: update test_eos_token_id_list_preserved to expect full list

… in perf print - src/: convert all absolute winml.modelkit.* imports to relative - qwen3/genai.py: from ....utils.genai import - utils/genai.py: from ..onnx import copy_onnx_model - session/genai_session.py: from .ep_registry / from ..winml (subprocess worker) - genai_session: patch failed-stage filename to absolute src path so ort-genai can resolve it when loading from compiled_dir (was crashing) - infer_genai.py: guard n/dt with max(dt,1e-9) to avoid ZeroDivisionError - tests: import GenaiSession symbols from package __init__ not submodule

… model types Now that qwen3_embeddings_only and qwen3_lm_head_only are available (merged from main via PR #1008), remove the placeholder pattern from the genai bundle assembly: - export_qwen3_transformer_only.py: when --genai-bundle is set, automatically build embeddings (fp32) and lm_head (w4a32/MatMulNBits) via WinMLAutoModel if --embeddings / --lm-head override paths are not provided - --embeddings / --lm-head flags are kept as optional override paths for callers that want to supply a pre-built ONNX instead of building from model_id - Both companion models are built on CPU (task=feature-extraction, no_compile) since they run on CPU in the genai pipeline - Drop the now-stale WARNING messages about missing embeddings/lm_head

…verride; fix shape_config - utils/genai.py: add _patch_seq_dim_dynamic helper; apply it to both embeddings and lm_head ONNX after copy in write_genai_bundle — ort-genai calls these models with prompt_len tokens on prefill and seq_len=1 on each decode step, so the seq_len dimension must be symbolic not fixed - session/genai_session.py: revert _prepare_cpu_bundle and the ep==cpu hook (GenaiSession uses genai_config.json as-is; cpu override not supported) - export_qwen3_transformer_only.py: remove shape_config from companion build call — embeddings/lm_head have dynamic seq_len, no static shape needed

_mirror_non_onnx_files previously skipped ALL .onnx/.data files, which meant embeddings.onnx and lm_head.onnx were inaccessible when ort-genai loaded from _compiled/. Now only the QNN-compiled stage files (those in qnn_stages) are excluded; CPU-side ONNX files are symlinked into the compiled bundle directory so ort-genai can find them. Also pass compiled_onnx_names from _prepare_compiled_bundle to the mirror helper so the skip set is driven by what was actually compiled. Verified: --compile produces valid EPContext for ctx/iter stages, embeddings.onnx and lm_head.onnx are symlinked, inference runs at ~37 tok/s on Snapdragon X Elite NPU.

Consolidate three scripts into a single unified CLI with sub-commands: qwen3.py export -- full genai bundle build (transformer + embeddings + lm_head), replaces export_qwen3_transformer_only.py and export_qwen3_embeddings_lm_head.py qwen3.py infer -- onnxruntime-genai streamed inference, replaces infer_genai.py Deleted: scripts/export_qwen3_embeddings_lm_head.py (obsolete since #1008 integrated embeddings/lm_head into the main export pipeline) scripts/export_qwen3_transformer_only.py scripts/infer_genai.py Changes: - Default --device is now npu (was cpu) to match the primary use-case - Default --max-cache-len is now 2048 (aligns with reference bundle) - --output replaces --genai-bundle for clarity - --bundle replaces --model-dir in the infer sub-command - --compile in export triggers EPContext pre-compilation via GenaiSession context-manager (no private API access) - node summary covers both transformer (GQA/QDQ) and companion models (Gather/MatMulNBits)

Keep utils/genai.py execution-provider-agnostic: build_decoder_pipeline_stages and write_genai_bundle now take opaque context/iterator session_options supplied by the caller instead of an ep/soc_model pair, and qnn_stage_session_options is removed. The QNN HTP session_options move into the Qwen3 module (models/hf/qwen3/genai.py), which wraps the generic builders so the emitted genai_config.json stays byte-identical to before. Remove session/genai_session.py and its test (covered by a separate PR); session/__init__.py no longer exports the genai session symbols. scripts/qwen3.py becomes export-only (drop the infer subcommand, --compile, and the GenaiSession import).

…reference model) Switch activation_type from uint16 to uint8 to align with the reference qwen3-genai-share model (w8a8 QDQ, int8 weights + uint8 activations). This keeps ctx.onnx / iter.onnx at opset 18 instead of opset 21. ORT forces opset >= 21 for 16-bit QDQ (uint16), so the previous uint16 choice caused an automatic opset bump to 21 that deviated from the reference graph layout. Update test name and assertion accordingly.

Revert the uint8 change. uint16 activations give better generation quality at the cost of opset 21 (required by ORT for 16-bit QDQ). This is the correct precision for the QNN NPU pipeline.

Add strip_node_attrs() to winml.modelkit.onnx — a generic utility that removes all attributes from matching op nodes except those listed in a keep_attrs set. Operates in-place on an onnx.ModelProto; safe for models with external data (modifies only the graph proto, not weight files). Wire it into write_genai_bundle() via a new transformer_onnx_passes parameter: a list of callables applied to ctx.onnx / iter.onnx after they are copied into the bundle directory. In scripts/qwen3.py, pass _strip_gqa_default_attrs (which retains only do_rotary / kv_num_heads / num_heads) to remove the five extra attrs that PyTorch's TorchScript ONNX exporter injects from the ORT com.microsoft::GroupQueryAttention schema: k_quant_type, local_window_size, qk_output, smooth_softmax, v_quant_type These are all no-op defaults and are absent from the reference model; stripping them brings our bundle's GQA attribute set in line with the reference. 8 new unit tests cover: extra-attr removal, keep-all, remove-all, domain mismatch no-op, multi-node graphs, and identity (same object returned).

WinMLQwen3Attention.forward was calling rotary_emb with torch.arange(config.max_position_embeddings) = 40960 positions, producing a 40960x64 cos/sin cache constant in every exported ONNX. The reference model uses a 4096x64 cache (= max_cache_len). Fix: use total_seq_len.item() (which equals max_cache_len at trace time, as set by _TransformerOnlySeqLenGenerator) instead of config.max_position_embeddings. This produces a cache of exactly max_cache_len rows — matching what will actually be needed at inference time and 10x smaller for the default Qwen3-0.6B export. Falls back to config.max_position_embeddings when total_seq_len is None (e.g. eager evaluation outside the export path). 4 new tests verify rope cache sizing across multiple max_cache_len values and the None fallback.

torch.export.export (used by torch.onnx.export at opset 18+) treats int(total_seq_len.item()) as Sym(u0), causing torch.arange(Sym(u0)) to resolve to arange(1) at trace-time, baking cos_cache=[1,64] into the ONNX graph instead of the intended [40960,64]. Fix: WinMLQwen3Attention.prepare_for_onnx_export now stores _max_rope_len as a plain Python int (defaults to config.max_position_embeddings=40960). forward() uses torch.arange(self._max_rope_len) -- a concrete int literal that both the TorchScript and torch.export backends bake as a constant tensor, giving cos_cache shape [40960, 64] which satisfies GQA's check cos_cache.shape[0] >= total_seq_len for any total_seq_len <= 40960. apply_transformer_only_export_prep accepts an optional max_rope_len kwarg to allow callers to override the rope length (e.g., to match max_cache_len). Tests updated: replaced total_seq_len-based assertions with _max_rope_len attribute tests and forward() rope-length-independence tests.

Remove the leftover 'from winml.modelkit.session import GenaiSession, GenerationConfig' import (those symbols were removed with genai_session.py, so the import would crash the export script) and the unused _SUPPORTED_EPS global. Both were flagged by CodeQL. Also correct the ctx/iter docstring from 'QNN-quantized' to 'QDQ-quantized' per review feedback.

Brings in winml perf --runtime winml-genai bundle benchmarking (#1015) and drops stale pre-#836 leftovers from the PR diff by advancing the merge-base.

The Qwen3 write_genai_bundle wrapper did not accept or forward the transformer_onnx_passes argument used by the generic assembler and the export script (scripts/qwen3.py), so 'qwen3 export' crashed at bundle assembly with a TypeError. Add the parameter and forward it verbatim to winml.modelkit.utils.genai.write_genai_bundle. Add regression tests covering the pass-through and the ep-derived session_options forwarding.

…t alerts - build_genai_config: keep a valid pad_token_id of 0 instead of falling back to bos_token_id (the `... or bos` form treated the falsy 0 as unset). - utils/genai.py: use a TYPE_CHECKING `from onnx import ModelProto` for the transformer_onnx_passes annotation so the module no longer carries a second `import onnx` (clears CodeQL py/repeated-import). - quant/calibration/base.py: drop the redundant `...` after the Protocol method docstring (clears CodeQL py/ineffectual-statement). - tests/unit/onnx/test_utils.py: import onnx symbols via a single `from onnx import ...` (clears CodeQL py/import-and-import-from). - tests: add a regression test asserting pad_token_id==0 is preserved.

…y wrapper Explain why QwenTransformerOnlyDecoderWrapper leaves max_rope_len at its default: threading the build's max_cache_len down to this load-time hook needs generic model-loader plumbing, which is deferred to the follow-up PR. The apply_transformer_only_export_prep(..., max_rope_len=...) path is already implemented and unit-tested.

xieofxie · 2026-07-03T05:56:40Z

The PR description is stale and should be refreshed before merge. The "What's new" table references files that aren't in the actual diff: scripts/export_qwen3_transformer_only.py is actually deleted (not modified), scripts/infer_genai.py is not in this PR, and the --genai-bundle / --embeddings / --lm-head flags on the export script don't exist. The real change instead adds scripts/qwen3.py and src/winml/modelkit/utils/genai.py. The Usage and Notes sections also describe an inference script that isn't part of this PR. A reviewer relying on the description will be misled — please update it to match the actual file set.

xieofxie · 2026-07-03T05:56:50Z

main() in scripts/qwen3.py ignores the parsed subcommand — it always hard-dispatches to _cmd_export(args) regardless of args.command. It works today with a single export subcommand, but the add_subparsers scaffolding implies more commands are coming, and the next one added would silently run export instead. Consider wiring dispatch via p.set_defaults(func=_cmd_export) (and return args.func(args)) so each subcommand routes to its own handler.

DingmaomaoBJTU · 2026-07-03T07:25:11Z

main() in scripts/qwen3.py ignores the parsed subcommand — it always hard-dispatches to _cmd_export(args) regardless of args.command. It works today with a single export subcommand, but the add_subparsers scaffolding implies more commands are coming, and the next one added would silently run export instead. Consider wiring dispatch via p.set_defaults(func=_cmd_export) (and return args.func(args)) so each subcommand routes to its own handler.

Sure~ Fixed. Btw in next pr I will remove this script~

main() hard-dispatched to _cmd_export regardless of the parsed subcommand, so any future subcommand added to the add_subparsers scaffold would silently run export. Register each subparser's handler with set_defaults(func=...) and dispatch through args.func(args).

DingmaomaoBJTU · 2026-07-03T07:29:54Z

Refreshed the PR title + description to match the actual diff (thanks for the catch @xieofxie). Dropped the stale bits — scripts/export_qwen3_transformer_only.py is deleted, and there's no scripts/infer_genai.py / --genai-bundle flag / inference script in this PR. The What's new table, Usage, and Notes now reflect the real files: scripts/qwen3.py (the unified export CLI), the generic src/winml/modelkit/utils/genai.py core, and the Qwen3/QNN src/winml/modelkit/models/hf/qwen3/genai.py layer. Also noted that inference is handled by the winml-genai runtime on main, not a script in this PR.

The main() dispatch fix is pushed in 272126e (p.set_defaults(func=_cmd_export) + return args.func(args)), so each subcommand routes to its own handler and a missing subcommand errors out.

github-advanced-security AI found potential problems Jun 29, 2026

View reviewed changes

Comment thread src/winml/modelkit/models/hf/qwen3/genai.py Fixed

github-advanced-security AI found potential problems Jun 29, 2026

View reviewed changes

Comment thread src/winml/modelkit/session/genai_session.py Fixed

github-advanced-security AI found potential problems Jun 29, 2026

View reviewed changes

Comment thread src/winml/modelkit/session/genai_session.py Fixed

github-actions Bot added 15 commits July 1, 2026 10:41

DingmaomaoBJTU force-pushed the pr/836/feature/qwen3-quant branch from 8364969 to f729f92 Compare July 1, 2026 02:45

github-advanced-security AI found potential problems Jul 1, 2026

View reviewed changes

Comment thread src/winml/modelkit/models/hf/qwen3/genai.py Fixed

Comment thread src/winml/modelkit/quant/calibration/base.py Fixed

Comment thread src/winml/modelkit/utils/genai.py

github-advanced-security AI found potential problems Jul 1, 2026

View reviewed changes

Comment thread src/winml/modelkit/utils/genai.py

github-actions Bot added 2 commits July 1, 2026 12:57

DingmaomaoBJTU commented Jul 1, 2026

View reviewed changes

Comment thread src/winml/modelkit/session/genai_session.py Outdated

DingmaomaoBJTU commented Jul 1, 2026

View reviewed changes

Comment thread src/winml/modelkit/session/genai_session.py Outdated

DingmaomaoBJTU commented Jul 1, 2026

View reviewed changes

Comment thread src/winml/modelkit/utils/genai.py

xieofxie reviewed Jul 1, 2026

View reviewed changes

Comment thread scripts/qwen3.py Outdated

xieofxie reviewed Jul 1, 2026

View reviewed changes

Comment thread scripts/qwen3.py Outdated

xieofxie mentioned this pull request Jul 1, 2026

feat: add an --exporter flag to specify which will we use to export the model #1010

Open

This was referenced Jul 1, 2026

feat: add genai inference to perf command #1011

Closed

feat: quantize: Composite model should have default precision for each sub model #1012

Open

feat: optimize / analyzer: Composite model could have default ep / device for each sub model #1013

Open

github-actions Bot added 6 commits July 3, 2026 09:01

revert(quant): restore w8a16 (uint16 activations) for transformer-only

776f328

Revert the uint8 change. uint16 activations give better generation quality at the cost of opset 21 (required by ORT for 16-bit QDQ). This is the correct precision for the QNN NPU pipeline.

github-advanced-security AI found potential problems Jul 3, 2026

View reviewed changes

github-actions Bot added 3 commits July 3, 2026 09:50

Merge origin/main into qwen3 genai bundle branch

0581106

Brings in winml perf --runtime winml-genai bundle benchmarking (#1015) and drops stale pre-#836 leftovers from the PR diff by advancing the merge-base.

DingmaomaoBJTU marked this pull request as ready for review July 3, 2026 04:11

DingmaomaoBJTU requested a review from a team as a code owner July 3, 2026 04:11

github-actions Bot added 2 commits July 3, 2026 12:29

xieofxie reviewed Jul 3, 2026

View reviewed changes

Comment thread src/winml/modelkit/models/hf/qwen3/genai.py

xieofxie reviewed Jul 3, 2026

View reviewed changes

Comment thread src/winml/modelkit/onnx/utils.py

DingmaomaoBJTU changed the title ~~feat(qwen3): genai bundle generation and inference script~~ feat(qwen3): genai bundle generation Jul 3, 2026

xieofxie approved these changes Jul 3, 2026

View reviewed changes

DingmaomaoBJTU merged commit 7004125 into main Jul 3, 2026
9 checks passed

DingmaomaoBJTU deleted the pr/836/feature/qwen3-quant branch July 3, 2026 08:09

Uh oh!

Conversation

DingmaomaoBJTU commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's new

Design

Usage

Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xieofxie commented Jul 3, 2026

Uh oh!

xieofxie commented Jul 3, 2026

Uh oh!

DingmaomaoBJTU commented Jul 3, 2026

Uh oh!

DingmaomaoBJTU commented Jul 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DingmaomaoBJTU commented Jun 29, 2026 •

edited

Loading