Skip to content

fix: DML/GPU build crash — skip compile config for EPs with enable_ep_context=False#405

Merged
DingmaomaoBJTU merged 7 commits into
mainfrom
qiowu/fix-dml-compile
Apr 29, 2026
Merged

fix: DML/GPU build crash — skip compile config for EPs with enable_ep_context=False#405
DingmaomaoBJTU merged 7 commits into
mainfrom
qiowu/fix-dml-compile

Conversation

@DingmaomaoBJTU

@DingmaomaoBJTU DingmaomaoBJTU commented Apr 27, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fixes #396

winml build crashes with FileNotFoundError when targeting GPU/DML because the compile stage produces no EPContext file, but the pipeline unconditionally advances current_path to the non-existent compiled_path.

Root Cause

Why DML hits this but CPU doesn't:

CPU is protected at config generation time: _DEVICE_TO_PROVIDER["cpu"] = NoneWinMLCompileConfig.for_provider(None) returns Noneconfig.compile = None → compile stage is skipped entirely by the first guard in _run_compile_stage.

DML goes through a different path: _DEVICE_TO_PROVIDER["gpu"] = "dml"WinMLCompileConfig.for_dml() returns a config with enable_ep_context=False. This produced a non-None compile config, so the stage was entered. But enable_ep_context=False means no EPContext file is ever written. The old code then unconditionally set current_path = compiled_path on a file that doesn't exist.

Fix

Two changes:

1. Fix at config level — WinMLCompileConfig.for_provider() returns None for non-EPContext EPs

Rather than checking enable_ep_context inside build.py, the fix teaches for_provider() that EPs with enable_ep_context=False have no offline compile step and should return None:

factory = factories.get(provider)
if factory:
    config = factory()
    # EPs that don't produce EPContext have no offline compile step
    if not config.ep_config.enable_ep_context:
        return None
    return config

This naturally covers DML, CPU, CUDA, NvTensorRTRTX, VitisAI, and MIGraphX without hardcoding EP names. The logic is derived from the enable_ep_context flag already set in each factory. Unknown/custom EPs (generic fallback) are not subject to this rule.

2. Raise explicitly when EPContext-producing EP reports success but output is missing

Replaces the silent compiled_path.exists() fallback with an explicit RuntimeError, so real failures (e.g. QNN) surface immediately:

if not compiled_path.exists():
    raise RuntimeError(
        f"Compile reported success but output not found: {compiled_path}"
    )

3. --no-compile defaults to True

Changed from is_flag=True, default=False to --no-compile/--compile pair with default=True, so compile is disabled by default in the CLI.

…ists

When compile runs with enable_ep_context=false (DML), no EPContext file
is produced. The build pipeline unconditionally set current_path to the
non-existent compiled_path, causing FileNotFoundError downstream.

Now checks compiled_path.exists() before updating current_path, so the
pipeline falls through to the previous stage's output (e.g. quantized.onnx).

Fixes #396
@DingmaomaoBJTU DingmaomaoBJTU requested a review from a team as a code owner April 27, 2026 14:13
@xieofxie

Copy link
Copy Markdown
Contributor

but I do think DML also supports compile, bug?

@DingmaomaoBJTU

DingmaomaoBJTU commented Apr 28, 2026

Copy link
Copy Markdown
Collaborator Author

Good catch - DML does support compile, but in our current flow it is JIT-style (usually no persisted EPContext file). See: https://github.com/microsoft/WinML-ModelKit/blob/77d553b75846178e94e0e1dca0d5d0ec623cfc7e/src/winml/modelkit/compiler/configs.py#L177 So the bug fixed here is the path assumption: _run_compile_stage switched to compiled_path even when no file was produced. Now it only switches when the file actually exists.

DML and CPU don't produce EPContext output, so running compile_onnx for
them is pure overhead. Skip the stage early when enable_ep_context=False
rather than running compile and silently falling back on missing output.

Also replace the compiled_path.exists() silent fallback with an explicit
RuntimeError for EPs that do expect EPContext output (e.g. QNN), so
silent failures are no longer swallowed.
--no-compile/--compile flag pair replaces the previous --no-compile
is_flag, with default=True (no-compile). Compilation is now opt-in:
users pass --compile to enable it, or keep the default to skip.
WinMLCompileConfig.for_provider() now checks enable_ep_context on the
factory result and returns None for EPs that don't produce EPContext
(dml, cpu, cuda, nv_tensorrt_rtx, vitisai, migraphx). This fixes the
DML build crash (#396) where compiled_path was set without checking the
file exists, by preventing DML from entering the compile stage at all.

Also changes --no-compile CLI default to True (compile disabled by
default) and adds a RuntimeError when compile reports success but the
output file is missing, replacing the previous silent fallback.
@DingmaomaoBJTU DingmaomaoBJTU changed the title fix: DML/GPU build fails with FileNotFoundError on compiled.onnx fix: DML/GPU build crash — skip compile config for EPs with enable_ep_context=False Apr 29, 2026
@DingmaomaoBJTU DingmaomaoBJTU merged commit 3a93a71 into main Apr 29, 2026
9 checks passed
@DingmaomaoBJTU DingmaomaoBJTU deleted the qiowu/fix-dml-compile branch April 29, 2026 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Build fails for DML/GPU: compiled.onnx missing when enable_ep_context=false

4 participants