feat: precision-driven quantization (FP16, RTN int4, static QDQ) via --precision flag by DingmaomaoBJTU · Pull Request #872 · microsoft/winml-cli

DingmaomaoBJTU · 2026-06-11T03:04:44Z

Summary

Adds precision-driven quantization to winml quantize and winml build. The --precision flag auto-selects the appropriate quantization algorithm: FP16 conversion, RTN weight-only, or static QDQ — no need to manually specify --algorithm or --rtn-bits.

Resolves #867

Supported Commands

Command	`--precision` support	Notes
`winml build`	✅	Full pipeline: export → optimize → quantize (precision controls quantize stage)
`winml quantize`	✅	Standalone quantization on pre-exported ONNX
`winml config`	✅	Generates build config with precision-resolved quant settings
`winml perf`	✅	Inherits from build pipeline (precision passed through)
`winml eval`	✅	Inherits from build pipeline (precision passed through)
`winml export`	❌ Removed	Precision is a quantize-stage concern, not export
`winml optimize`	❌ Removed	Precision is a quantize-stage concern, not optimize

Precision → Algorithm Auto-Resolution

Precision	Algorithm	Calibration	Output
`fp16`	FP16 conversion	❌	Full-model FP32→FP16
`int4`	RTN weight-only	❌	MatMulNBits (4-bit packed weights)
`w4a16`	RTN weight-only	❌	Same as int4
`w4a8`	RTN weight-only	❌	4-bit weights, 8-bit activation spec
`int8`	Static QDQ	✅	QuantizeLinear/DequantizeLinear W8A8
`int16`	Static QDQ	✅	W16A16 QDQ
`w8a16`	Static QDQ	✅	Mixed W8A16 QDQ
`w8a8`	Static QDQ	✅	Same as int8
`auto`	Device-dependent	Varies	Auto-selects for target hardware

Key rule: 4-bit weight → RTN (no QDQ support for 4-bit), 8/16-bit → static QDQ.

Usage

# FP16 — no calibration, ~50% size reduction
winml quantize -m model.onnx --precision fp16
winml build -m facebook/convnext-tiny-224 -o out/ --precision fp16

# RTN 4-bit weight-only — no calibration, fast
winml quantize -m model.onnx --precision int4
winml build -m facebook/convnext-tiny-224 -o out/ --precision int4

# Static QDQ INT8 — requires calibration data
winml quantize -m model.onnx --precision int8 --samples 100
winml build -m facebook/convnext-tiny-224 -o out/ --precision int8

# Warnings for mismatched options
winml quantize -m model.onnx --precision fp16 --samples 50
# → Warning: --samples ignored — FP16 conversion does not use calibration data.

Design

Algorithm Selection Logic

precision → is_weight_only_precision()?
  ├─ Yes (int4, w4a16) → RTN path (MatMulNBitsQuantizer)
  ├─ No, fp16?         → FP16 path (convert_float_to_float16)
  └─ No, quantized?    → QDQ path (static quantization + calibration)

RTN Configuration

--precision int4 automatically sets:

algorithm = "rtn"
rtn_bits = 4 (derived from precision)
rtn_block_size = 128 (default, tunable via future CLI flag)
rtn_symmetric = True (default)

Advanced users can tune RTN params without needing --rtn-bits — bit-width is always inferred from precision.

Validation

FP16/RTN/dynamic algorithms skip task/model_name validation (no calibration needed)
Only algorithm="static" requires calibration → requires task/model_name for HF builds
Invalid precisions (e.g., banana, w4a4) produce clear error messages

E2E Verified (ConvNeXt-Tiny-224)

Command	Result
`winml build --precision fp16`	✅ 109MB→54.6MB, 87s
`winml build --precision int4`	✅ RTN 4-bit, 87s
`winml quantize --precision fp16`	✅ 4.7s
`winml quantize --precision int4`	✅ 1.1s (RTN)
`winml quantize --precision int8`	✅ 46s, 761 QDQ nodes
`winml quantize --precision w4a16`	✅ Same as int4
`winml quantize --precision fp16 --samples 50`	✅ Warning printed
`winml quantize --precision int4 --samples 50`	✅ Warning printed
`winml quantize --precision banana`	✅ Error: invalid precision

Files Changed

Core

config/precision.py — Added int4 preset, is_weight_only_precision(), extract_weight_bits(), expanded _VALID_WEIGHT_BITS to include 4
config/build.py — resolve_quant_compile_config creates RTN config for weight-only; validation skips calibration requirements for RTN/FP16/dynamic
quant/quantizer.py — Three execution paths: FP16 fast path → RTN (MatMulNBitsQuantizer) → static QDQ
quant/config.py — algorithm field (static/dynamic/rtn), RTN params, FP16 params
commands/quantize.py — CLI routing: FP16 → RTN → QDQ, with calibration-ignored warnings
commands/build.py — _patch_device propagates algorithm/RTN fields; _run_quantize_stage has dedicated RTN path with StageLive output
optim/fp16.py — convert_to_fp16() with keep_io_types, already-FP16 skip, topo sort fix

Removed

--precision from export and optimize commands (quantize stage handles all precision work)

TODO (follow-up PRs)

Mixed precision (FP16 + QDQ): e.g., --precision int8 --fp16 to run QDQ quantization first, then convert remaining FP32 ops to FP16. The config infrastructure supports this (fp16=True, fp16_only=False) but CLI flags and build pipeline routing are not yet wired.
RTN CLI tuning flags: Expose --rtn-block-size, --rtn-symmetric, --rtn-accuracy-level on winml quantize and winml build for advanced users.
Dynamic quantization path: Wire --algorithm dynamic through quantize command (config already supports algorithm="dynamic" but no CLI flag yet).

timenick

Three findings on PR #872.

🤖 Generated with GitHub Copilot CLI

Add FP16 precision conversion support across all model pipeline commands: - Create optim/fp16.py with convert_to_fp16() utility (wraps ORT float16) - optimize: --precision fp16 with --fp16-keep-io-types and --fp16-op-block-list - build: --precision fp16 stage between optimize and quantize - export: --precision fp16 as post-export conversion - Add shared precision_option() CLI decorator in utils/cli.py Design: FP16 is a precision transformation (not a graph optimization), so it lives as a command-layer utility rather than an optimizer pipe. All three commands share the same convert_to_fp16() function. Fixes #867

- Add algorithm, fp16, fp16_only, fp16_keep_io_types, fp16_op_block_list, and RTN fields to WinMLQuantizationConfig - quantize_onnx now supports pure-FP16 fast path (fp16_only=True skips QDQ) and FP16 post-processing after QDQ (fp16=True, fp16_only=False) - resolve_quant_compile_config returns fp16_only quant config for precision=fp16 - Remove _run_fp16_stage and skip-quantize hack from build.py pipelines - Build pipeline unified: Export -> Optimize -> Quantize Stage -> Compile where Quantize Stage handles both QDQ and FP16 conversion - Update tests to reflect new behavior (fp16 produces quant config, not None)

- Remove --precision flag and FP16 conversion from export command - Remove --precision, --fp16-keep-io-types, --fp16-op-block-list from optimize command and all FP16 conversion logic - Add --precision fp16 support to quantize command (creates fp16_only config, uses quantize_onnx FP16 fast path) - FP16 precision is now only available through: - winml quantize --precision fp16 (standalone) - winml build --precision fp16 (E2E pipeline) - winml perf/eval --precision fp16 (E2E commands)

Expand build's --precision from fp32/fp16 only to the full precision range: auto, fp32, fp16, int8, int16, and w{x}a{y} format (e.g., w8a8, w8a16). This unifies the build and quantize CLI experience. Changes: - Update precision_option() to accept free-form string instead of click.Choice restricted to fp32/fp16 - Pass precision to generate_build_config() for proper quant config resolution at config generation time - Pass precision to resolve_quant_compile_config() in _patch_device for config-file builds with --precision override - Propagate fp16/fp16_only fields when patching existing quant config - Add early validation using _is_valid_precision() for clear error messages - Add precision examples to build command help text

Replace 'import onnx' + 'from onnx import ...' dual-import pattern with consistent 'from onnx import ...' style to satisfy CodeQL's 'Module is imported with import and import from' check.

- Remove duplicate old precision_option (main already has expanded version) - Update test_precision_fp16_clears_quant to expect fp16_only quant config instead of quant=None (matches our FP16-in-quantize design) - Remove duplicate --precision fp16 build example (main already has one)

When --precision fp16 is used, calibration-related flags (--samples, --method, --weight-type, --activation-type) have no effect. Add explicit warnings in both the CLI layer (quantize command) and the API layer (quantize_onnx) so users are not silently surprised.

FP16-only quantization configs do not perform calibration, so they do not need task or model_name fields. The validation now treats fp16_only the same as ONNX builds and submodule builds.

Only static QDQ quantization requires calibration data (and thus task/model_name). RTN (weight-only) and dynamic quantization do not need calibration, so they should not require these fields.

- Add int4 to named precisions, support w4a{8,16} as weight-only RTN - Add is_weight_only_precision() and extract_weight_bits() helpers - resolve_quant_compile_config creates RTN config for weight-only - quantize command: add RTN fast path between FP16 and QDQ paths - quantize_onnx: implement RTN path using ORT MatMulNBitsQuantizer - Update tests for new valid precision values (int4, w4a16)

…tion - _patch_device now propagates algorithm/rtn_bits to existing quant config - _run_quantize_stage: add RTN path with proper StageLive output - quantizer: extract .model (ModelProto) from ONNXModel wrapper

- Add type annotation to fp16.py convert result (no-any-return) - Add assert for precision not None in quantize.py (union-attr) - Remove duplicate imports in build.py _run_quantize_stage

- Add RTN branch in generate_hf_build_config (int4/w4a16 was silently skipped) - Pass use_external_data to save_onnx in FP16 and RTN paths (quantizer.py) - Extract _warn_ignored_calibration_options helper to remove duplication

- QDQ FP16 post-processing: apply convert_to_fp16 in-memory instead of save-reload-save round-trip (matches RTN pattern) - Pass use_external_data consistently to all save_onnx calls - extract_weight_bits: validate bit-widths against supported sets - Add test for unsupported bit-width combinations (w4a4, w3a8, etc.) - Clarify dynamic algorithm as planned-not-yet-wired in config comment

DingmaomaoBJTU requested a review from a team as a code owner June 11, 2026 03:04

github-advanced-security AI found potential problems Jun 11, 2026

View reviewed changes

Comment thread tests/unit/optim/pipes/test_pipe_fp16.py Fixed

DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 8f5a1d2 to 9e7d8fd Compare June 11, 2026 04:15

DingmaomaoBJTU changed the title ~~feat: add --enable-fp16-conversion to winml optimize~~ feat: add --precision fp16 to optimize, build, and export commands Jun 11, 2026

github-advanced-security AI found potential problems Jun 11, 2026

View reviewed changes

Comment thread tests/unit/optim/test_fp16.py Fixed

DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 9e7d8fd to 7d7a0ae Compare June 11, 2026 04:22

github-advanced-security AI found potential problems Jun 11, 2026

View reviewed changes

Comment thread src/winml/modelkit/optim/fp16.py Fixed

DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 7d7a0ae to 328b5ab Compare June 11, 2026 04:32

timenick reviewed Jun 11, 2026

View reviewed changes

Comment thread src/winml/modelkit/commands/build.py Outdated

Comment thread src/winml/modelkit/commands/build.py Outdated

Comment thread tests/unit/optim/test_fp16.py Outdated

DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 328b5ab to b859627 Compare June 11, 2026 05:26

github-advanced-security AI found potential problems Jun 11, 2026

View reviewed changes

Comment thread tests/unit/optim/test_fp16.py Fixed

DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch 2 times, most recently from 837330d to fede96c Compare June 11, 2026 07:43

DingmaomaoBJTU changed the title ~~feat: add --precision fp16 to optimize, build, and export commands~~ feat: FP16 precision support via quantize stage + extended build --precision Jun 23, 2026

DingmaomaoBJTU and others added 7 commits June 23, 2026 15:16

chore: remove spurious .data files

37f12a4

fix: resolve CodeQL import warnings in fp16 module

3b4e69f

Replace 'import onnx' + 'from onnx import ...' dual-import pattern with consistent 'from onnx import ...' style to satisfy CodeQL's 'Module is imported with import and import from' check.

DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 82c92cb to 75be8d3 Compare June 23, 2026 07:37

github-advanced-security AI found potential problems Jun 23, 2026

View reviewed changes

Comment thread src/winml/modelkit/commands/build.py Fixed

github-actions Bot added 6 commits June 23, 2026 15:43

fix: skip task/model_name validation for fp16_only quant configs

4597e07

FP16-only quantization configs do not perform calibration, so they do not need task or model_name fields. The validation now treats fp16_only the same as ONNX builds and submodule builds.

fix: skip calibration validation for rtn and dynamic algorithms

e882dd5

Only static QDQ quantization requires calibration data (and thus task/model_name). RTN (weight-only) and dynamic quantization do not need calibration, so they should not require these fields.

fix: build pipeline RTN routing and MatMulNBitsQuantizer model extrac…

762f2d0

…tion - _patch_device now propagates algorithm/rtn_bits to existing quant config - _run_quantize_stage: add RTN path with proper StageLive output - quantizer: extract .model (ModelProto) from ONNXModel wrapper

fix: resolve lint warnings (raw regex strings, unused variable)

43d27b3

DingmaomaoBJTU changed the title ~~feat: FP16 precision support via quantize stage + extended build --precision~~ feat: precision-driven quantization (FP16, RTN int4, static QDQ) via --precision flag Jun 23, 2026

github-actions Bot added 3 commits June 23, 2026 18:12

fix: resolve mypy type errors and remove duplicate imports

1183861

- Add type annotation to fp16.py convert result (no-any-return) - Add assert for precision not None in quantize.py (union-attr) - Remove duplicate imports in build.py _run_quantize_stage

fix: address code review findings

b99fabf

- Add RTN branch in generate_hf_build_config (int4/w4a16 was silently skipped) - Pass use_external_data to save_onnx in FP16 and RTN paths (quantizer.py) - Extract _warn_ignored_calibration_options helper to remove duplication

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: precision-driven quantization (FP16, RTN int4, static QDQ) via --precision flag#872

feat: precision-driven quantization (FP16, RTN int4, static QDQ) via --precision flag#872
DingmaomaoBJTU wants to merge 16 commits into
mainfrom
dingmaomaobjtu/feat-fp16-conversion

DingmaomaoBJTU commented Jun 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timenick left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

DingmaomaoBJTU commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Supported Commands

Precision → Algorithm Auto-Resolution

Usage

Design

Algorithm Selection Logic

RTN Configuration

Validation

E2E Verified (ConvNeXt-Tiny-224)

Files Changed

Core

Removed

TODO (follow-up PRs)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timenick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DingmaomaoBJTU commented Jun 11, 2026 •

edited

Loading