Skip to content

Add ProcMon DLL diagnostics for Stable Diffusion ROCm CI failures#1632

Closed
Copilot wants to merge 2 commits intomainfrom
copilot/update-ci-add-procmon-diagnostics
Closed

Add ProcMon DLL diagnostics for Stable Diffusion ROCm CI failures#1632
Copilot wants to merge 2 commits intomainfrom
copilot/update-ci-add-procmon-diagnostics

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 14, 2026

sd-cli.exe exits with 0xC0000135 (STATUS_DLL_NOT_FOUND) on the stx-halo ROCm runner, causing test_020_upscale_basic to return 500. The specific missing DLL(s) are unknown without loader-level tracing.

Changes

  • ci/procmon-dll-diagnose.ps1 — Downloads Sysinternals ProcMon, runs a short capture while invoking sd-cli.exe -M upscale --help, exports to CSV, then filters and prints all NAME NOT FOUND/PATH NOT FOUND entries for .dll files triggered by sd-cli.exe. Exits with sd-cli.exe's exit code so the step still fails when the binary is broken.

  • cpp_server_build_test_release.yml — Adds two steps to test-exe-inference, guarded to run only when matrix.script == 'server_sd.py' && contains(matrix.backends, 'rocm'):

    1. Run procmon-dll-diagnose.ps1 after the test step (if: always()) to capture diagnostics even on failure
    2. Upload sd-cli.pml + sd-cli.csv as artifact sd-cli-procmon-${{ matrix.name }} (if-no-files-found: warn — ProcMon availability should not gate CI result)

No other matrix entries are affected.

Original prompt

Update PR #1460 CI to add ProcMon-based diagnostics for missing DLLs when running Stable Diffusion tests on Windows (especially ROCm backend).

Context:

Requested change:

  1. Add a PowerShell script to repo (suggested path: ci/procmon-dll-diagnose.ps1) that:

    • Downloads Sysinternals Process Monitor (ProcMon)
    • Starts a capture to a backing file in RUNNER_TEMP
    • Runs the ROCm sd-cli.exe once (use the path seen in logs: .\ci-cache\bin\sd-cpp\rocm-preview\sd-cli.exe) with a harmless command (e.g., "--help" or "-M upscale --help")
    • Stops ProcMon capture
    • Exports the PML log to CSV
    • Prints to stdout a filtered list of likely missing DLL load failures (lines containing sd-cli.exe, .dll, and NAME NOT FOUND or PATH NOT FOUND)
    • Exits with the same exit code as sd-cli.exe so the job still fails if it’s broken.
  2. Wire this script into the workflow .github/workflows/cpp_server_build_test_release.yml in the test-exe-inference job so it runs ONLY for Stable Diffusion test matrix entries (matrix.script == server_sd.py). Preferably guard so it only runs when matrix.backends contains rocm as well.

  3. Upload diagnostics artifacts even on failure:

    • Upload ${{ runner.temp }}\sd-cli.pml and ${{ runner.temp }}\sd-cli.csv (if they exist)
    • Artifact name should include matrix.name (e.g., sd-cli-procmon-${{ matrix.name }})

Acceptance:

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: Please find a solution for this failing job. Use the logs, job definition, and any referenced files where the failure occurred. Keep your response focused on the solution and include code suggestions when appropriate.

Assistant: The job is failing in the Stable Diffusion inference tests on the ROCm backend (matrix.name: stable-diffusion (stx-halo), --backend rocm). The Python test failure is:

  • test/server_sd.py::StableDiffusionTests.test_020_upscale_basic (line ~650 on ref c74b23aee2b4d92f4e01d223a438601d419fe3ca)
  • Expected 200, got 500 from POST /api/v1/images/upscale
  • Server returned: {"error":{"message":"ESRGAN upscale failed","type":"server_error"}}

The server log shows the root cause:

  • Lemonade downloads RealESRGAN-x4plus successfully
  • Then it starts sd-cli.exe -M upscale ...
  • That subprocess crashes immediately with exit code -1073741515
    • On Windows, -1073741515 == 0xC0000135 which almost always means a required DLL dependency was not found (missing runtime DLL).

So this is not a test bug; it’s an environment/runtime packaging issue for the ROCm “sd-cli.exe” upscale path (likely missing VC++ runtime or another dependent DLL when invoked for ESRGAN).

Fix (recommended): ship/copy required DLLs alongside sd-cli.exe in the CI cache/package

Ensure that whatever produces .\\ci-cache\\bin\\sd-cpp\\rocm-preview\\sd-cli.exe also places its dependent DLLs in the same directory (or otherwise makes them discoverable via PATH). Concretely:

  1. Identify missing DLLs by running sd-cli.exe in CI and capturing loader errors.
  2. Include those DLLs in the artifact/installer or copy them into the rocm-preview folder during setup.

Practical CI-side mitigation (fastest to unblock)

Add a Windows step before running the tests to install the VC++ runtime (common cause of 0xC0000135 on fresh runners/self-hosted images), then re-run.

In .github/workflows/cpp_server_build_test_release.yml under test-exe-inference job, add a step before “Run tests”:

      - name: Install Microsoft Visual C++ Runtime (required by sd-cli.exe)
        shell: PowerShell
        run: |
          $ErrorActionPreference = "Stop"

          # Install latest supported VC++ runtime (x64)
          $uri = "https://aka.ms/vs/17/release/vc_redist.x64.exe"
          $out = "$env:RUNNER_TEMP\vc_redist.x64.exe"
          Invoke-WebRequest -Uri $uri -OutFile $out -UseBasicParsing
          $p = Start-Process -FilePath $out -ArgumentList "/install","/quiet","/norestart" -Wait -PassThru
          if ($p.ExitCode -ne 0 -and $p.ExitCode -ne 3010) {
            throw "VC++ redist installer failed with exit code $($p.ExitCode)"
          }

Why this helps: sd-cli.exe is a native binary and commonly depends on the VC runtime DLLs; if they’re missing on the runner, Windows fails to load the executable with 0xC0000135, which matches your log sy...

This pull request was created from Copilot chat.

Copilot AI changed the title [WIP] Update CI to add ProcMon-based diagnostics for missing DLLs Add ProcMon DLL diagnostics for Stable Diffusion ROCm CI failures Apr 14, 2026
Copilot AI requested a review from Geramy April 14, 2026 16:31
@Geramy
Copy link
Copy Markdown
Member

Geramy commented Apr 14, 2026

@copilot this is wrong it should be merged into superm1/multiple-rocm-backends

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 14, 2026

@copilot this is wrong it should be merged into superm1/multiple-rocm-backends

I can't force-push or open a new PR directly due to tool constraints. The correct commit is 98908fc — it contains only the two diagnostic file changes on top of origin/superm1/multiple-rocm-backends with no conflicts.

To create the right PR:

git fetch origin superm1/multiple-rocm-backends
git checkout -b procmon-sd-diagnostics origin/superm1/multiple-rocm-backends
git cherry-pick 98908fc
git push origin procmon-sd-diagnostics

Then open a PR targeting superm1/multiple-rocm-backends.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants