feat(qwen3): genai bundle generation#996
Conversation
- src/winml/modelkit/models/hf/qwen3/genai.py: new module with build_genai_config() and write_genai_bundle(). build_genai_config generates the onnxruntime-genai pipeline config JSON from a HF PretrainedConfig + max_cache_len + prefill_seq_len. write_genai_bundle copies the winml-built ctx/iter ONNX, optional placeholder embeddings and lm_head ONNX, saves tokenizer files from HF, and writes genai_config.json. - scripts/export_qwen3_transformer_only.py: add --genai-bundle DIR, --embeddings ONNX, --lm-head ONNX flags. When --genai-bundle is set, write_genai_bundle is called after the build to emit a complete onnxruntime-genai bundle. - scripts/infer_genai.py: new inference script. Loads the genai bundle with og.Config, registers WinML EPs (QNN), and runs greedy generation via og.Generator. Supports --ep cpu|qnn, --chat template wrapping, --max-new, --context-length, --verbose. - src/winml/modelkit/models/hf/qwen3/__init__.py: export build_genai_config and write_genai_bundle. - tests/unit/models/qwen3/test_genai_config.py: 21 unit tests for build_genai_config covering pipeline structure, KV name counts, tensor name constants, edge cases (list eos_token_id, missing head_dim, None pad_token_id, custom filenames, variable layer count).
…tion
Replace hardcoded tensor-name constants with a data-driven design:
- PipelineStage dataclass: carries name, filename, run_on_prompt/token_gen,
inputs, outputs, is_lm_head. Callers construct stages explicitly; no
tensor names are baked into build_genai_config itself.
- DecoderIOMapping dataclass: holds the %d-style format strings that genai
uses to expand per-layer KV tensor names. Defaults match Qwen3 naming
but any naming convention is supported.
- build_genai_config: now takes pipeline: list[PipelineStage] and
decoder_io: DecoderIOMapping. Architecture-agnostic; no Qwen3-specific
logic. prefill_seq_len=None omits the sliding_window section.
- _introspect_onnx_io: reads graph.input / graph.output from an ONNX
model without loading external data weights.
- _detect_format_patterns: scans tensor names for indexed groups matching
<prefix><int> with exactly num_layers consecutive zero-based indices,
returns {prefix: 'prefix%d'} patterns.
- build_qwen3_transformer_only_stages: Qwen3-specific factory that calls
_introspect_onnx_io on the built ctx/iter ONNX, detects KV patterns via
_detect_format_patterns, and returns (list[PipelineStage], DecoderIOMapping).
Tensor names can never drift from the actual ONNX graph I/O.
- write_genai_bundle: delegates to build_qwen3_transformer_only_stages
instead of hardcoding names.
Tests (35 total, all pass):
- TestBuildGenaiConfig: +2 new cases (no sliding_window, custom DecoderIOMapping)
- TestDetectFormatPatterns: 6 new unit tests for the pattern detector
- TestBuildQwen3TransformerOnlyStages: 6 new tests using patched
_introspect_onnx_io (no real ONNX files required)
- GenaiSession drives og.Model + og.Generator lifecycle for autoregressive text generation; peer class to WinMLSession (not a subclass) - GenerationConfig dataclass: temperature, top_p, top_k, max_new_tokens, repetition_penalty, do_sample - Lazy onnxruntime_genai import via _import_og() — class importable without the package installed (raises GenaiNotInstalledError on first use) - Reuses WinMLEPRegistry for EP discovery/registration (idempotent) - EP support: cpu (clear_providers only), qnn, dml - context_length read from genai_config.json; overridable at construction - generate_streaming() yields decoded token strings; generator del'd in finally - generate() returns joined string; auto-load on first call if not loaded - 33 unit tests; all use patch.dict(sys.modules) to avoid real hardware
- Moves chat template logic from infer_genai.py into GenaiSession - Supports optional system prompt - ChatML is not Qwen3-specific; used by Qwen2/3, Yi, Mistral, etc. - infer_genai.py _wrap_chat_template now delegates to the static method - Updated --chat flag help text and script docstring - 4 new tests covering user-only, with-system, no-system-turn, assistant-priming
- PipelineStage gains session_options: dict | None = None field; PipelineStage.to_dict() emits it when set - Add _qnn_stage_session_options(log_id, soc_model) helper that produces QNN HTP provider_options for a pipeline stage - build_qwen3_transformer_only_stages gains ep='cpu' and soc_model='60' params; when ep='qnn' the context and iterator stages receive QNN session_options, embeddings and lm_head stay on CPU (no session_options) - write_genai_bundle threads ep/soc_model through - export_qwen3_transformer_only.py passes ep='qnn' when --device npu - 5 new tests covering cpu/qnn ep routing and soc_model propagation (39 total, all pass)
Remove clear_providers/append_provider calls from GenaiSession.load(). EP placement is fully driven by per-stage session_options in genai_config.json. clear_providers() only clears the top-level provider and cannot override per-stage session_options embedded in the pipeline config. - Add 'mixed' EP (use genai_config.json as-is; default for infer_genai.py) - _NEEDS_WINML_EPS covers mixed/qnn/dml to trigger EP registration - Replace _EP_PROVIDER_MAP with _VALID_EPS + _NEEDS_WINML_EPS sets - Update tests: remove append_provider assertions, add mixed/config-not-modified tests - infer_genai.py default EP changed from 'cpu' to 'mixed' Result: NPU bundle (out/qwen3_bundle_npu) now runs at 9.3 tok/s vs 1.2 tok/s CPU
- GenaiSession gains compile=True parameter - _prepare_compiled_bundle(): detects QNN stages from genai_config.json, compiles each stage to EPContext ONNX via ort.ModelCompiler in a subprocess - _compile_stage(): 5-minute timeout per stage to handle QNN SDK hang (known bug: w8a16 + multi-token prefill hangs indefinitely) - Compiled artifacts cached in bundle_dir/_compiled/; reused on subsequent runs - _mirror_non_onnx_files(): symlinks/copies tokenizer files so og.Config can load from the compiled sub-directory - infer_genai.py --compile flag wired through to GenaiSession
…on_optimization_mode=0 Root cause: QNN SDK ModelCompiler deadlocks when compiling w8a16 quantized ONNX with multi-token static input shapes (seq_len > 1) at graph finalization optimization levels 1-3. The genai_config uses level 3 for runtime inference, which triggers the hang when passed to ModelCompiler directly. Fix: _compile_stage now forces htp_graph_finalization_optimization_mode=0 for compilation. This lets ModelCompiler finish (ctx ~41s, iter ~67s) while runtime inference still uses the full level-3 optimization from genai_config (EPContext loading bypasses compilation entirely, so the runtime option is irrelevant). Also fixes: - Pipeline stage detection: genai_config uses 'qnn' key (not 'QNNExecutionProvider') in provider_options; detection and option extraction now uses the correct key - _patch_stage_filename: genai_config pipeline is a list, not a dict; updated to iterate list entries correctly - _prepare_compiled_bundle: passes QNN provider options from each stage's session_options to _compile_stage so soc_model, backend_path, etc. are respected - Removed the 'prefill fallback to JIT' warning since the hang is now fixed
… spawn Windows multiprocessing spawn serialises the subprocess target via pickle. Local functions (closures) defined inside a method cannot be pickled, which caused 'AttributeError: Can't pickle local function' at runtime. Moved the compilation logic to a module-level function _qnn_compile_worker so it is importable by name in the spawned subprocess. Also fix ONNX filename in compiled genai_config: use ctx_onnx.name (just the filename) instead of str(ctx_onnx) (absolute path). ort-genai resolves filenames relative to the directory passed to og.Config, so an absolute path causes double-path concatenation and a 'file not found' error.
… stages Previously _compile_stage forced mode='0' for ALL stages to avoid a QNN SDK deadlock on w8a16 + multi-token prefill. This also silently capped the iter (generation) stage at mode 0, producing under-optimized kernels (~10 tok/s). Fix: only force mode=0 for prefill stages (run_on_prompt=true, seq_len>1 where the deadlock occurs). Generation stages (run_on_token_gen=true, seq_len=1) use the configured mode from genai_config.json (typically '3'), which is safe for single-token input and produces fully-optimized kernels. Performance: Before: 10.4 tok/s (both ctx+iter compiled with mode 0) After: 43.4 tok/s (ctx mode 0, iter mode 3) — matches reference ~45 tok/s _prepare_compiled_bundle now passes is_prefill flag per stage based on run_on_prompt / run_on_token_gen fields in genai_config.json pipeline config.
… _compile_stage The original mode=0 override was added to avoid a QNN SDK deadlock when compiling w8a16 prefill (seq_len>1) at higher optimization levels. Testing revealed the deadlock only occurs when QNN provider options are NOT passed to ort.ModelCompiler at all (causing it to fall back to a broken default path). With correct QNN options (backend_path, soc_model, etc.) forwarded, mode=3 compiles successfully for both ctx (~73s) and iter (~67s) with no hang. Remove the is_prefill flag and mode override entirely. _compile_stage now passes genai_config QNN options unchanged, giving fully-optimized kernels for all stages. Performance (hot NPU, EPContext loaded): ctx+iter both mode=3: ~44.5 tok/s vs reference ~45 tok/s
…i as shim - Extract all architecture-agnostic logic (PipelineStage, DecoderIOMapping, build_genai_config, build_decoder_pipeline_stages, write_genai_bundle, qnn_stage_session_options, ONNX introspection helpers) into src/winml/modelkit/utils/genai.py so other model families can reuse it - Reduce qwen3/genai.py to a thin re-export shim with a backward-compatible build_qwen3_transformer_only_stages alias for existing callers - fix(codeql): remove unused _TOKENIZER_FILES from utils/genai.py - fix(codeql): remove unnecessary del generator in GenaiSession.generate_streaming - fix(codeql): add missing Protocol body ellipsis in QuantConfigFinalizer.finalize - fix(codeql): import get_quant_finalizer directly in quant/__init__.py - fix(test): update mock patch path to winml.modelkit.utils.genai._introspect_onnx_io - fix(test): replace bare 'import onnx' with 'from onnx import ...' in test_qwen3_calibration.py
- fix(_mirror_non_onnx_files): skip .onnx/.data files to avoid duplicating
multi-GB model weights into _compiled/ on first --compile run
- fix(generate_streaming): restore try/finally around og.Generator so the
KV cache buffer is freed immediately on early caller exit (GeneratorExit),
not deferred until GC
- fix(build_genai_config): preserve eos_token_id list unchanged — ORT genai
accepts a JSON array; truncating to [0] silently discards secondary stop
tokens (e.g. Qwen3's [151645, 151643])
- fix(build_decoder_pipeline_stages): use name-based KV pattern matching
('key'/'val' in prefix) instead of purely positional, so models that list
past_values before past_keys in their ONNX graph don't get a silent swap
- fix(qwen3/genai __all__): remove private _detect_format_patterns from
__all__; tests now import it directly from winml.modelkit.utils.genai
- test: update test_eos_token_id_list_preserved to expect full list
… in perf print - src/: convert all absolute winml.modelkit.* imports to relative - qwen3/genai.py: from ....utils.genai import - utils/genai.py: from ..onnx import copy_onnx_model - session/genai_session.py: from .ep_registry / from ..winml (subprocess worker) - genai_session: patch failed-stage filename to absolute src path so ort-genai can resolve it when loading from compiled_dir (was crashing) - infer_genai.py: guard n/dt with max(dt,1e-9) to avoid ZeroDivisionError - tests: import GenaiSession symbols from package __init__ not submodule
… model types Now that qwen3_embeddings_only and qwen3_lm_head_only are available (merged from main via PR #1008), remove the placeholder pattern from the genai bundle assembly: - export_qwen3_transformer_only.py: when --genai-bundle is set, automatically build embeddings (fp32) and lm_head (w4a32/MatMulNBits) via WinMLAutoModel if --embeddings / --lm-head override paths are not provided - --embeddings / --lm-head flags are kept as optional override paths for callers that want to supply a pre-built ONNX instead of building from model_id - Both companion models are built on CPU (task=feature-extraction, no_compile) since they run on CPU in the genai pipeline - Drop the now-stale WARNING messages about missing embeddings/lm_head
8364969 to
f729f92
Compare
…verride; fix shape_config - utils/genai.py: add _patch_seq_dim_dynamic helper; apply it to both embeddings and lm_head ONNX after copy in write_genai_bundle — ort-genai calls these models with prompt_len tokens on prefill and seq_len=1 on each decode step, so the seq_len dimension must be symbolic not fixed - session/genai_session.py: revert _prepare_cpu_bundle and the ep==cpu hook (GenaiSession uses genai_config.json as-is; cpu override not supported) - export_qwen3_transformer_only.py: remove shape_config from companion build call — embeddings/lm_head have dynamic seq_len, no static shape needed
_mirror_non_onnx_files previously skipped ALL .onnx/.data files, which meant embeddings.onnx and lm_head.onnx were inaccessible when ort-genai loaded from _compiled/. Now only the QNN-compiled stage files (those in qnn_stages) are excluded; CPU-side ONNX files are symlinked into the compiled bundle directory so ort-genai can find them. Also pass compiled_onnx_names from _prepare_compiled_bundle to the mirror helper so the skip set is driven by what was actually compiled. Verified: --compile produces valid EPContext for ctx/iter stages, embeddings.onnx and lm_head.onnx are symlinked, inference runs at ~37 tok/s on Snapdragon X Elite NPU.
Consolidate three scripts into a single unified CLI with sub-commands:
qwen3.py export -- full genai bundle build (transformer + embeddings
+ lm_head), replaces export_qwen3_transformer_only.py
and export_qwen3_embeddings_lm_head.py
qwen3.py infer -- onnxruntime-genai streamed inference,
replaces infer_genai.py
Deleted:
scripts/export_qwen3_embeddings_lm_head.py (obsolete since #1008
integrated embeddings/lm_head into the main export pipeline)
scripts/export_qwen3_transformer_only.py
scripts/infer_genai.py
Changes:
- Default --device is now npu (was cpu) to match the primary use-case
- Default --max-cache-len is now 2048 (aligns with reference bundle)
- --output replaces --genai-bundle for clarity
- --bundle replaces --model-dir in the infer sub-command
- --compile in export triggers EPContext pre-compilation via GenaiSession
context-manager (no private API access)
- node summary covers both transformer (GQA/QDQ) and companion models
(Gather/MatMulNBits)
Keep utils/genai.py execution-provider-agnostic: build_decoder_pipeline_stages and write_genai_bundle now take opaque context/iterator session_options supplied by the caller instead of an ep/soc_model pair, and qnn_stage_session_options is removed. The QNN HTP session_options move into the Qwen3 module (models/hf/qwen3/genai.py), which wraps the generic builders so the emitted genai_config.json stays byte-identical to before. Remove session/genai_session.py and its test (covered by a separate PR); session/__init__.py no longer exports the genai session symbols. scripts/qwen3.py becomes export-only (drop the infer subcommand, --compile, and the GenaiSession import).
…reference model) Switch activation_type from uint16 to uint8 to align with the reference qwen3-genai-share model (w8a8 QDQ, int8 weights + uint8 activations). This keeps ctx.onnx / iter.onnx at opset 18 instead of opset 21. ORT forces opset >= 21 for 16-bit QDQ (uint16), so the previous uint16 choice caused an automatic opset bump to 21 that deviated from the reference graph layout. Update test name and assertion accordingly.
Revert the uint8 change. uint16 activations give better generation quality at the cost of opset 21 (required by ORT for 16-bit QDQ). This is the correct precision for the QNN NPU pipeline.
Add strip_node_attrs() to winml.modelkit.onnx — a generic utility that removes all attributes from matching op nodes except those listed in a keep_attrs set. Operates in-place on an onnx.ModelProto; safe for models with external data (modifies only the graph proto, not weight files). Wire it into write_genai_bundle() via a new transformer_onnx_passes parameter: a list of callables applied to ctx.onnx / iter.onnx after they are copied into the bundle directory. In scripts/qwen3.py, pass _strip_gqa_default_attrs (which retains only do_rotary / kv_num_heads / num_heads) to remove the five extra attrs that PyTorch's TorchScript ONNX exporter injects from the ORT com.microsoft::GroupQueryAttention schema: k_quant_type, local_window_size, qk_output, smooth_softmax, v_quant_type These are all no-op defaults and are absent from the reference model; stripping them brings our bundle's GQA attribute set in line with the reference. 8 new unit tests cover: extra-attr removal, keep-all, remove-all, domain mismatch no-op, multi-node graphs, and identity (same object returned).
WinMLQwen3Attention.forward was calling rotary_emb with torch.arange(config.max_position_embeddings) = 40960 positions, producing a 40960x64 cos/sin cache constant in every exported ONNX. The reference model uses a 4096x64 cache (= max_cache_len). Fix: use total_seq_len.item() (which equals max_cache_len at trace time, as set by _TransformerOnlySeqLenGenerator) instead of config.max_position_embeddings. This produces a cache of exactly max_cache_len rows — matching what will actually be needed at inference time and 10x smaller for the default Qwen3-0.6B export. Falls back to config.max_position_embeddings when total_seq_len is None (e.g. eager evaluation outside the export path). 4 new tests verify rope cache sizing across multiple max_cache_len values and the None fallback.
torch.export.export (used by torch.onnx.export at opset 18+) treats int(total_seq_len.item()) as Sym(u0), causing torch.arange(Sym(u0)) to resolve to arange(1) at trace-time, baking cos_cache=[1,64] into the ONNX graph instead of the intended [40960,64]. Fix: WinMLQwen3Attention.prepare_for_onnx_export now stores _max_rope_len as a plain Python int (defaults to config.max_position_embeddings=40960). forward() uses torch.arange(self._max_rope_len) -- a concrete int literal that both the TorchScript and torch.export backends bake as a constant tensor, giving cos_cache shape [40960, 64] which satisfies GQA's check cos_cache.shape[0] >= total_seq_len for any total_seq_len <= 40960. apply_transformer_only_export_prep accepts an optional max_rope_len kwarg to allow callers to override the rope length (e.g., to match max_cache_len). Tests updated: replaced total_seq_len-based assertions with _max_rope_len attribute tests and forward() rope-length-independence tests.
Remove the leftover 'from winml.modelkit.session import GenaiSession, GenerationConfig' import (those symbols were removed with genai_session.py, so the import would crash the export script) and the unused _SUPPORTED_EPS global. Both were flagged by CodeQL. Also correct the ctx/iter docstring from 'QNN-quantized' to 'QDQ-quantized' per review feedback.
The Qwen3 write_genai_bundle wrapper did not accept or forward the transformer_onnx_passes argument used by the generic assembler and the export script (scripts/qwen3.py), so 'qwen3 export' crashed at bundle assembly with a TypeError. Add the parameter and forward it verbatim to winml.modelkit.utils.genai.write_genai_bundle. Add regression tests covering the pass-through and the ep-derived session_options forwarding.
…t alerts - build_genai_config: keep a valid pad_token_id of 0 instead of falling back to bos_token_id (the `... or bos` form treated the falsy 0 as unset). - utils/genai.py: use a TYPE_CHECKING `from onnx import ModelProto` for the transformer_onnx_passes annotation so the module no longer carries a second `import onnx` (clears CodeQL py/repeated-import). - quant/calibration/base.py: drop the redundant `...` after the Protocol method docstring (clears CodeQL py/ineffectual-statement). - tests/unit/onnx/test_utils.py: import onnx symbols via a single `from onnx import ...` (clears CodeQL py/import-and-import-from). - tests: add a regression test asserting pad_token_id==0 is preserved.
…y wrapper Explain why QwenTransformerOnlyDecoderWrapper leaves max_rope_len at its default: threading the build's max_cache_len down to this load-time hook needs generic model-loader plumbing, which is deferred to the follow-up PR. The apply_transformer_only_export_prep(..., max_rope_len=...) path is already implemented and unit-tested.
|
The PR description is stale and should be refreshed before merge. The "What's new" table references files that aren't in the actual diff: |
|
|
Sure~ Fixed. Btw in next pr I will remove this script~ |
main() hard-dispatched to _cmd_export regardless of the parsed subcommand, so any future subcommand added to the add_subparsers scaffold would silently run export. Register each subparser's handler with set_defaults(func=...) and dispatch through args.func(args).
|
Refreshed the PR title + description to match the actual diff (thanks for the catch @xieofxie). Dropped the stale bits — The |
Summary
Adds
onnxruntime-genaibundle generation for winml-exported Qwen3 transformer-only models. A singlescripts/qwen3.py exportcommand builds all four bundle components (transformerctx/iter,embeddings,lm_head) and assembles them into an onnxruntime-genai directory (genai_config.json+ HF tokenizer files). The bundle-assembly machinery is split into an EP-agnostic core and a Qwen3/QNN-specific layer.Inference over the assembled bundle is handled by the
winml-genairuntime (winml perf --runtime winml-genai, already onmain); this PR is generation-only. A dedicated genai inference session lands in a follow-up.What's new
src/winml/modelkit/utils/genai.pyPipelineStage,DecoderIOMapping,build_genai_config(),build_decoder_pipeline_stages(),write_genai_bundle()src/winml/modelkit/models/hf/qwen3/genai.pybuild_qwen3_transformer_only_stages(),write_genai_bundle()wrapper,qnn_stage_session_options()(QNN HTP routing)scripts/qwen3.pyexportsubcommand)scripts/export_qwen3_transformer_only.py,scripts/export_qwen3_embeddings_lm_head.pyscripts/qwen3.pysrc/winml/modelkit/onnx/utils.pystrip_node_attrs()(drops exporter-injected default GQA attributes)src/winml/modelkit/models/hf/qwen3/qwen3_modeling.pymax_rope_lenexport hook (prepared; defaults to current behavior)tests/unit/models/qwen3/test_genai_config.py,test_qwen3_modeling.py,tests/unit/onnx/test_utils.pyDesign
utils/genai.pyis architecture-agnostic.build_genai_configtakes alist[PipelineStage]and aDecoderIOMapping— no tensor names or EP details are hardcoded.build_decoder_pipeline_stagesintrospects the builtctx.onnx/iter.onnx(_introspect_onnx_io+_detect_format_patterns) to discoverpast_keys_%d/present_values_%d-style patterns from the actual graph I/O, so tensor names can never drift from what the ONNX really contains.models/hf/qwen3/genai.pyis the Qwen3/QNN-specific layer.build_qwen3_transformer_only_stageswires the four Qwen3 stages, andqnn_stage_session_optionsemits the QNN HTPsession_optionsfor the transformer stages — all QNN specifics live here, not in the generic core. Itswrite_genai_bundlewraps the generic assembler so the one-shot API is unchanged.Usage
Notes
embeddings.onnx(fp32) andlm_head.onnx(w4a32MatMulNBits) are built by the script on CPU;--embeddings/--lm-headoverride them with pre-built files.ctx/iterstages route to the QNN HTP (NPU) via per-stagesession_optionsingenai_config.json; embeddings/lm_head stay on CPU.winml perf -m out/bundle --runtime winml-genaiinstalls the WinML QNN EP, runs the transformer stages on the NPU, and generates correctly.Follows from #836.