Skip to content

feat: precision-driven quantization (FP16, RTN int4, static QDQ) via --precision flag#872

Open
DingmaomaoBJTU wants to merge 16 commits into
mainfrom
dingmaomaobjtu/feat-fp16-conversion
Open

feat: precision-driven quantization (FP16, RTN int4, static QDQ) via --precision flag#872
DingmaomaoBJTU wants to merge 16 commits into
mainfrom
dingmaomaobjtu/feat-fp16-conversion

Conversation

@DingmaomaoBJTU

@DingmaomaoBJTU DingmaomaoBJTU commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds precision-driven quantization to winml quantize and winml build. The --precision flag auto-selects the appropriate quantization algorithm: FP16 conversion, RTN weight-only, or static QDQ — no need to manually specify --algorithm or --rtn-bits.

Resolves #867

Supported Commands

Command --precision support Notes
winml build Full pipeline: export → optimize → quantize (precision controls quantize stage)
winml quantize Standalone quantization on pre-exported ONNX
winml config Generates build config with precision-resolved quant settings
winml perf Inherits from build pipeline (precision passed through)
winml eval Inherits from build pipeline (precision passed through)
winml export ❌ Removed Precision is a quantize-stage concern, not export
winml optimize ❌ Removed Precision is a quantize-stage concern, not optimize

Precision → Algorithm Auto-Resolution

Precision Algorithm Calibration Output
fp16 FP16 conversion Full-model FP32→FP16
int4 RTN weight-only MatMulNBits (4-bit packed weights)
w4a16 RTN weight-only Same as int4
w4a8 RTN weight-only 4-bit weights, 8-bit activation spec
int8 Static QDQ QuantizeLinear/DequantizeLinear W8A8
int16 Static QDQ W16A16 QDQ
w8a16 Static QDQ Mixed W8A16 QDQ
w8a8 Static QDQ Same as int8
auto Device-dependent Varies Auto-selects for target hardware

Key rule: 4-bit weight → RTN (no QDQ support for 4-bit), 8/16-bit → static QDQ.

Usage

# FP16 — no calibration, ~50% size reduction
winml quantize -m model.onnx --precision fp16
winml build -m facebook/convnext-tiny-224 -o out/ --precision fp16

# RTN 4-bit weight-only — no calibration, fast
winml quantize -m model.onnx --precision int4
winml build -m facebook/convnext-tiny-224 -o out/ --precision int4

# Static QDQ INT8 — requires calibration data
winml quantize -m model.onnx --precision int8 --samples 100
winml build -m facebook/convnext-tiny-224 -o out/ --precision int8

# Warnings for mismatched options
winml quantize -m model.onnx --precision fp16 --samples 50
# → Warning: --samples ignored — FP16 conversion does not use calibration data.

Design

Algorithm Selection Logic

precision → is_weight_only_precision()?
  ├─ Yes (int4, w4a16) → RTN path (MatMulNBitsQuantizer)
  ├─ No, fp16?         → FP16 path (convert_float_to_float16)
  └─ No, quantized?    → QDQ path (static quantization + calibration)

RTN Configuration

--precision int4 automatically sets:

  • algorithm = "rtn"
  • rtn_bits = 4 (derived from precision)
  • rtn_block_size = 128 (default, tunable via future CLI flag)
  • rtn_symmetric = True (default)

Advanced users can tune RTN params without needing --rtn-bits — bit-width is always inferred from precision.

Validation

  • FP16/RTN/dynamic algorithms skip task/model_name validation (no calibration needed)
  • Only algorithm="static" requires calibration → requires task/model_name for HF builds
  • Invalid precisions (e.g., banana, w4a4) produce clear error messages

E2E Verified (ConvNeXt-Tiny-224)

Command Result
winml build --precision fp16 ✅ 109MB→54.6MB, 87s
winml build --precision int4 ✅ RTN 4-bit, 87s
winml quantize --precision fp16 ✅ 4.7s
winml quantize --precision int4 ✅ 1.1s (RTN)
winml quantize --precision int8 ✅ 46s, 761 QDQ nodes
winml quantize --precision w4a16 ✅ Same as int4
winml quantize --precision fp16 --samples 50 ✅ Warning printed
winml quantize --precision int4 --samples 50 ✅ Warning printed
winml quantize --precision banana ✅ Error: invalid precision

Files Changed

Core

  • config/precision.py — Added int4 preset, is_weight_only_precision(), extract_weight_bits(), expanded _VALID_WEIGHT_BITS to include 4
  • config/build.pyresolve_quant_compile_config creates RTN config for weight-only; validation skips calibration requirements for RTN/FP16/dynamic
  • quant/quantizer.py — Three execution paths: FP16 fast path → RTN (MatMulNBitsQuantizer) → static QDQ
  • quant/config.pyalgorithm field (static/dynamic/rtn), RTN params, FP16 params
  • commands/quantize.py — CLI routing: FP16 → RTN → QDQ, with calibration-ignored warnings
  • commands/build.py_patch_device propagates algorithm/RTN fields; _run_quantize_stage has dedicated RTN path with StageLive output
  • optim/fp16.pyconvert_to_fp16() with keep_io_types, already-FP16 skip, topo sort fix

Removed

  • --precision from export and optimize commands (quantize stage handles all precision work)

TODO (follow-up PRs)

  • Mixed precision (FP16 + QDQ): e.g., --precision int8 --fp16 to run QDQ quantization first, then convert remaining FP32 ops to FP16. The config infrastructure supports this (fp16=True, fp16_only=False) but CLI flags and build pipeline routing are not yet wired.
  • RTN CLI tuning flags: Expose --rtn-block-size, --rtn-symmetric, --rtn-accuracy-level on winml quantize and winml build for advanced users.
  • Dynamic quantization path: Wire --algorithm dynamic through quantize command (config already supports algorithm="dynamic" but no CLI flag yet).

@DingmaomaoBJTU DingmaomaoBJTU requested a review from a team as a code owner June 11, 2026 03:04
Comment thread tests/unit/optim/pipes/test_pipe_fp16.py Fixed
@DingmaomaoBJTU DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 8f5a1d2 to 9e7d8fd Compare June 11, 2026 04:15
@DingmaomaoBJTU DingmaomaoBJTU changed the title feat: add --enable-fp16-conversion to winml optimize feat: add --precision fp16 to optimize, build, and export commands Jun 11, 2026
Comment thread tests/unit/optim/test_fp16.py Fixed
@DingmaomaoBJTU DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 9e7d8fd to 7d7a0ae Compare June 11, 2026 04:22
Comment thread src/winml/modelkit/optim/fp16.py Fixed
@DingmaomaoBJTU DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 7d7a0ae to 328b5ab Compare June 11, 2026 04:32

@timenick timenick left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three findings on PR #872.

🤖 Generated with GitHub Copilot CLI

Comment thread src/winml/modelkit/commands/build.py Outdated
Comment thread src/winml/modelkit/commands/build.py Outdated
Comment thread tests/unit/optim/test_fp16.py Outdated
@DingmaomaoBJTU DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 328b5ab to b859627 Compare June 11, 2026 05:26
Comment thread tests/unit/optim/test_fp16.py Fixed
@DingmaomaoBJTU DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch 2 times, most recently from 837330d to fede96c Compare June 11, 2026 07:43
@DingmaomaoBJTU DingmaomaoBJTU changed the title feat: add --precision fp16 to optimize, build, and export commands feat: FP16 precision support via quantize stage + extended build --precision Jun 23, 2026
DingmaomaoBJTU and others added 7 commits June 23, 2026 15:16
Add FP16 precision conversion support across all model pipeline commands:

- Create optim/fp16.py with convert_to_fp16() utility (wraps ORT float16)
- optimize: --precision fp16 with --fp16-keep-io-types and --fp16-op-block-list
- build: --precision fp16 stage between optimize and quantize
- export: --precision fp16 as post-export conversion
- Add shared precision_option() CLI decorator in utils/cli.py

Design: FP16 is a precision transformation (not a graph optimization), so it
lives as a command-layer utility rather than an optimizer pipe. All three
commands share the same convert_to_fp16() function.

Fixes #867
- Add algorithm, fp16, fp16_only, fp16_keep_io_types, fp16_op_block_list,
  and RTN fields to WinMLQuantizationConfig
- quantize_onnx now supports pure-FP16 fast path (fp16_only=True skips QDQ)
  and FP16 post-processing after QDQ (fp16=True, fp16_only=False)
- resolve_quant_compile_config returns fp16_only quant config for precision=fp16
- Remove _run_fp16_stage and skip-quantize hack from build.py pipelines
- Build pipeline unified: Export -> Optimize -> Quantize Stage -> Compile
  where Quantize Stage handles both QDQ and FP16 conversion
- Update tests to reflect new behavior (fp16 produces quant config, not None)
- Remove --precision flag and FP16 conversion from export command
- Remove --precision, --fp16-keep-io-types, --fp16-op-block-list from
  optimize command and all FP16 conversion logic
- Add --precision fp16 support to quantize command (creates fp16_only
  config, uses quantize_onnx FP16 fast path)
- FP16 precision is now only available through:
  - winml quantize --precision fp16 (standalone)
  - winml build --precision fp16 (E2E pipeline)
  - winml perf/eval --precision fp16 (E2E commands)
Expand build's --precision from fp32/fp16 only to the full precision
range: auto, fp32, fp16, int8, int16, and w{x}a{y} format (e.g., w8a8,
w8a16). This unifies the build and quantize CLI experience.

Changes:
- Update precision_option() to accept free-form string instead of
  click.Choice restricted to fp32/fp16
- Pass precision to generate_build_config() for proper quant config
  resolution at config generation time
- Pass precision to resolve_quant_compile_config() in _patch_device
  for config-file builds with --precision override
- Propagate fp16/fp16_only fields when patching existing quant config
- Add early validation using _is_valid_precision() for clear error
  messages
- Add precision examples to build command help text
Replace 'import onnx' + 'from onnx import ...' dual-import pattern
with consistent 'from onnx import ...' style to satisfy CodeQL's
'Module is imported with import and import from' check.
- Remove duplicate old precision_option (main already has expanded version)
- Update test_precision_fp16_clears_quant to expect fp16_only quant
  config instead of quant=None (matches our FP16-in-quantize design)
- Remove duplicate --precision fp16 build example (main already has one)
@DingmaomaoBJTU DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 82c92cb to 75be8d3 Compare June 23, 2026 07:37
Comment thread src/winml/modelkit/commands/build.py Fixed
github-actions Bot added 6 commits June 23, 2026 15:43
When --precision fp16 is used, calibration-related flags (--samples,
--method, --weight-type, --activation-type) have no effect. Add
explicit warnings in both the CLI layer (quantize command) and the
API layer (quantize_onnx) so users are not silently surprised.
FP16-only quantization configs do not perform calibration, so they
do not need task or model_name fields. The validation now treats
fp16_only the same as ONNX builds and submodule builds.
Only static QDQ quantization requires calibration data (and thus
task/model_name). RTN (weight-only) and dynamic quantization do not
need calibration, so they should not require these fields.
- Add int4 to named precisions, support w4a{8,16} as weight-only RTN
- Add is_weight_only_precision() and extract_weight_bits() helpers
- resolve_quant_compile_config creates RTN config for weight-only
- quantize command: add RTN fast path between FP16 and QDQ paths
- quantize_onnx: implement RTN path using ORT MatMulNBitsQuantizer
- Update tests for new valid precision values (int4, w4a16)
…tion

- _patch_device now propagates algorithm/rtn_bits to existing quant config
- _run_quantize_stage: add RTN path with proper StageLive output
- quantizer: extract .model (ModelProto) from ONNXModel wrapper
@DingmaomaoBJTU DingmaomaoBJTU changed the title feat: FP16 precision support via quantize stage + extended build --precision feat: precision-driven quantization (FP16, RTN int4, static QDQ) via --precision flag Jun 23, 2026
github-actions Bot added 3 commits June 23, 2026 18:12
- Add type annotation to fp16.py convert result (no-any-return)
- Add assert for precision not None in quantize.py (union-attr)
- Remove duplicate imports in build.py _run_quantize_stage
- Add RTN branch in generate_hf_build_config (int4/w4a16 was silently skipped)
- Pass use_external_data to save_onnx in FP16 and RTN paths (quantizer.py)
- Extract _warn_ignored_calibration_options helper to remove duplication
- QDQ FP16 post-processing: apply convert_to_fp16 in-memory instead of
  save-reload-save round-trip (matches RTN pattern)
- Pass use_external_data consistently to all save_onnx calls
- extract_weight_bits: validate bit-widths against supported sets
- Add test for unsupported bit-width combinations (w4a4, w3a8, etc.)
- Clarify dynamic algorithm as planned-not-yet-wired in config comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add --enable-fp16-conversion to winml optimize and --precision to winml build/export

3 participants