Skip to content

perf: optimize winml sys startup (55s → 4s on Qualcomm)#266

Closed
timenick wants to merge 10 commits into
mainfrom
zhiwang/optimize-winml-sys-perf
Closed

perf: optimize winml sys startup (55s → 4s on Qualcomm)#266
timenick wants to merge 10 commits into
mainfrom
zhiwang/optimize-winml-sys-perf

Conversation

@timenick

@timenick timenick commented Apr 8, 2026

Copy link
Copy Markdown
Collaborator

Summary

Resolves #261winml sys took ~55s on Snapdragon X Plus with no progress indicator, appearing hung.

Root causes identified and fixed:

  • Eager module loading: winml.modelkit.__init__.py and cli.py eagerly imported torch/transformers/optimum (~6s) even for lightweight commands like sys
  • Per-device PowerShell processes: PnpDevice.__init__ spawned a separate PowerShell for each NPU's Get-PnpDeviceProperty — extremely slow on Qualcomm ACPI devices
  • Serial PowerShell invocations: CPU, GPU, NPU, and OS queries each launched independent PowerShell processes (~2s cold start each)
  • No parallelism: Hardware probing and Python-side work (torch import, library scanning) ran sequentially

Optimizations applied:

Optimization Impact
Lazy __init__.py for winml.modelkit and session packages Skip torch/transformers/optimum import for sys command
Lazy CLI command discovery (_LazyGroup) Only import the invoked subcommand module
platform.version() for Windows 11 detection Eliminate OS.get() PowerShell call entirely
query_all_hardware() — single PowerShell process Merge CIM + PnP + PnpDeviceProperty into one invocation
ThreadPoolExecutor parallelism Overlap PowerShell probe with import torch / library scanning
Batched Get-PnpDeviceProperty -KeyName Query only needed properties instead of all

Benchmarks

Command Device main This PR Speedup
winml sys Qualcomm ARM64 55s 4.2s 13.1x
winml sys --list-device Qualcomm ARM64 54s 4.3s 12.6x
winml sys Intel x64 11.0s 2.7s 4.1x
winml sys --list-device Intel x64 10.7s 2.5s 4.3x

Changed files

  • __init__.py (winml.modelkit) — __getattr__ lazy loading with cached resolution
  • cli.py_LazyGroup for on-demand command module import
  • commands/sys.py_is_windows_11(), query_all_hardware(), parallel execution
  • session/__init__.py__getattr__ lazy loading to break circular import chain
  • sysinfo/helper.pyCimInstance.get_many_by_class_name(), PnpDevice batched properties, query_all_hardware()
  • sysinfo/hardware.pyNPU._EXTRA_PROPERTY_KEYS for targeted property fetch
  • onnx/detection.py — Move QDQ_OP_TYPES import to function level (break circular import)
  • tests/unit/sysinfo/test_sysinfo.py — Updated for _is_windows_11(), added edge case coverage

timenick added 2 commits April 8, 2026 11:23
- Lazy-load winml.modelkit and session __init__.py to skip torch/
  transformers/optimum import for lightweight commands
- Lazy CLI command discovery — only import the invoked subcommand
- Replace OS.get() WMI call with platform.version() for Win11 detection
- Merge all hardware queries (CIM + PnP + properties) into a single
  PowerShell process via query_all_hardware()
- Parallelize PowerShell hardware probe with Python-side work (torch
  import, library version scanning) using ThreadPoolExecutor
- Move QDQ_OP_TYPES import to function level to break onnx ↔ compiler
  circular import exposed by lazy loading
- Cache __getattr__ results via globals() to avoid repeated resolution

Benchmarks:
  Intel x64:       11s → 2.7s
  Qualcomm ARM64:  64s → 8.2s
Comment thread src/winml/modelkit/commands/sys.py Fixed
Comment thread src/winml/modelkit/__init__.py Fixed
Comment thread src/winml/modelkit/commands/sys.py Fixed
Comment thread src/winml/modelkit/commands/sys.py Fixed
@timenick timenick marked this pull request as ready for review April 8, 2026 08:39
@timenick timenick requested a review from a team as a code owner April 8, 2026 08:39
@timenick timenick requested a review from tezheng April 8, 2026 08:59
@timenick timenick deleted the zhiwang/optimize-winml-sys-perf branch April 13, 2026 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

winml sys takes ~90 seconds on first run with no progress indicator (appears hung)

3 participants