Add ProcMon DLL diagnostics for Stable Diffusion ROCm CI failures#1632
Closed
Add ProcMon DLL diagnostics for Stable Diffusion ROCm CI failures#1632
Conversation
Agent-Logs-Url: https://github.com/lemonade-sdk/lemonade/sessions/04a006e7-b2cc-4146-a2d2-45b75658f8f7 Co-authored-by: Geramy <264964+Geramy@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Update CI to add ProcMon-based diagnostics for missing DLLs
Add ProcMon DLL diagnostics for Stable Diffusion ROCm CI failures
Apr 14, 2026
Member
|
@copilot this is wrong it should be merged into superm1/multiple-rocm-backends |
Contributor
Author
I can't force-push or open a new PR directly due to tool constraints. The correct commit is To create the right PR: git fetch origin superm1/multiple-rocm-backends
git checkout -b procmon-sd-diagnostics origin/superm1/multiple-rocm-backends
git cherry-pick 98908fc
git push origin procmon-sd-diagnosticsThen open a PR targeting |
Copilot stopped work on behalf of
Geramy due to an error
April 14, 2026 16:50
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
sd-cli.exeexits with0xC0000135(STATUS_DLL_NOT_FOUND) on thestx-haloROCm runner, causingtest_020_upscale_basicto return 500. The specific missing DLL(s) are unknown without loader-level tracing.Changes
ci/procmon-dll-diagnose.ps1— Downloads Sysinternals ProcMon, runs a short capture while invokingsd-cli.exe -M upscale --help, exports to CSV, then filters and prints allNAME NOT FOUND/PATH NOT FOUNDentries for.dllfiles triggered bysd-cli.exe. Exits withsd-cli.exe's exit code so the step still fails when the binary is broken.cpp_server_build_test_release.yml— Adds two steps totest-exe-inference, guarded to run only whenmatrix.script == 'server_sd.py' && contains(matrix.backends, 'rocm'):procmon-dll-diagnose.ps1after the test step (if: always()) to capture diagnostics even on failuresd-cli.pml+sd-cli.csvas artifactsd-cli-procmon-${{ matrix.name }}(if-no-files-found: warn— ProcMon availability should not gate CI result)No other matrix entries are affected.
Original prompt
Update PR #1460 CI to add ProcMon-based diagnostics for missing DLLs when running Stable Diffusion tests on Windows (especially ROCm backend).
Context:
Requested change:
Add a PowerShell script to repo (suggested path: ci/procmon-dll-diagnose.ps1) that:
Wire this script into the workflow .github/workflows/cpp_server_build_test_release.yml in the test-exe-inference job so it runs ONLY for Stable Diffusion test matrix entries (matrix.script == server_sd.py). Preferably guard so it only runs when matrix.backends contains rocm as well.
Upload diagnostics artifacts even on failure:
Acceptance:
The following is the prior conversation context from the user's chat exploration (may be truncated):
User: Please find a solution for this failing job. Use the logs, job definition, and any referenced files where the failure occurred. Keep your response focused on the solution and include code suggestions when appropriate.
Assistant: The job is failing in the Stable Diffusion inference tests on the ROCm backend (
matrix.name: stable-diffusion (stx-halo),--backend rocm). The Python test failure is:test/server_sd.py::StableDiffusionTests.test_020_upscale_basic(line ~650 on refc74b23aee2b4d92f4e01d223a438601d419fe3ca)POST /api/v1/images/upscale{"error":{"message":"ESRGAN upscale failed","type":"server_error"}}The server log shows the root cause:
RealESRGAN-x4plussuccessfullysd-cli.exe -M upscale ...-1073741515-1073741515==0xC0000135which almost always means a required DLL dependency was not found (missing runtime DLL).So this is not a test bug; it’s an environment/runtime packaging issue for the ROCm “sd-cli.exe” upscale path (likely missing VC++ runtime or another dependent DLL when invoked for ESRGAN).
Fix (recommended): ship/copy required DLLs alongside
sd-cli.exein the CI cache/packageEnsure that whatever produces
.\\ci-cache\\bin\\sd-cpp\\rocm-preview\\sd-cli.exealso places its dependent DLLs in the same directory (or otherwise makes them discoverable via PATH). Concretely:sd-cli.exein CI and capturing loader errors.rocm-previewfolder during setup.Practical CI-side mitigation (fastest to unblock)
Add a Windows step before running the tests to install the VC++ runtime (common cause of
0xC0000135on fresh runners/self-hosted images), then re-run.In
.github/workflows/cpp_server_build_test_release.ymlundertest-exe-inferencejob, add a step before “Run tests”:Why this helps:
sd-cli.exeis a native binary and commonly depends on the VC runtime DLLs; if they’re missing on the runner, Windows fails to load the executable with0xC0000135, which matches your log sy...This pull request was created from Copilot chat.