perf: optimize winml sys startup (55s → 4s on Qualcomm)#266
Closed
timenick wants to merge 10 commits into
Closed
Conversation
- Lazy-load winml.modelkit and session __init__.py to skip torch/ transformers/optimum import for lightweight commands - Lazy CLI command discovery — only import the invoked subcommand - Replace OS.get() WMI call with platform.version() for Win11 detection - Merge all hardware queries (CIM + PnP + properties) into a single PowerShell process via query_all_hardware() - Parallelize PowerShell hardware probe with Python-side work (torch import, library version scanning) using ThreadPoolExecutor - Move QDQ_OP_TYPES import to function level to break onnx ↔ compiler circular import exposed by lazy loading - Cache __getattr__ results via globals() to avoid repeated resolution Benchmarks: Intel x64: 11s → 2.7s Qualcomm ARM64: 64s → 8.2s
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Resolves #261 —
winml systook ~55s on Snapdragon X Plus with no progress indicator, appearing hung.Root causes identified and fixed:
winml.modelkit.__init__.pyandcli.pyeagerly imported torch/transformers/optimum (~6s) even for lightweight commands likesysPnpDevice.__init__spawned a separate PowerShell for each NPU'sGet-PnpDeviceProperty— extremely slow on Qualcomm ACPI devicesOptimizations applied:
__init__.pyforwinml.modelkitandsessionpackagessyscommand_LazyGroup)platform.version()for Windows 11 detectionOS.get()PowerShell call entirelyquery_all_hardware()— single PowerShell processThreadPoolExecutorparallelismimport torch/ library scanningGet-PnpDeviceProperty -KeyNameBenchmarks
winml syswinml sys --list-devicewinml syswinml sys --list-deviceChanged files
__init__.py(winml.modelkit) —__getattr__lazy loading with cached resolutioncli.py—_LazyGroupfor on-demand command module importcommands/sys.py—_is_windows_11(),query_all_hardware(), parallel executionsession/__init__.py—__getattr__lazy loading to break circular import chainsysinfo/helper.py—CimInstance.get_many_by_class_name(),PnpDevicebatched properties,query_all_hardware()sysinfo/hardware.py—NPU._EXTRA_PROPERTY_KEYSfor targeted property fetchonnx/detection.py— MoveQDQ_OP_TYPESimport to function level (break circular import)tests/unit/sysinfo/test_sysinfo.py— Updated for_is_windows_11(), added edge case coverage