Add ProcMon DLL diagnostic for Windows ROCm Stable Diffusion CI#1633
Draft
Copilot wants to merge 3 commits intosuperm1/multiple-rocm-backendsfrom
Draft
Add ProcMon DLL diagnostic for Windows ROCm Stable Diffusion CI#1633Copilot wants to merge 3 commits intosuperm1/multiple-rocm-backendsfrom
Copilot wants to merge 3 commits intosuperm1/multiple-rocm-backendsfrom
Conversation
Agent-Logs-Url: https://github.com/lemonade-sdk/lemonade/sessions/0d7c9ebd-fa04-409d-b43a-83acfd7b6bb1 Co-authored-by: Geramy <264964+Geramy@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Add ProcMon-based diagnostic step to Windows ROCm test workflow
Add ProcMon DLL diagnostic for Windows ROCm Stable Diffusion CI
Apr 14, 2026
Increased timeout for CSV file readiness check and added a function to test file readability.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
sd-cli.execrashes with exit code-1073741515(missing DLL/runtime) during the ESRGAN upscale test on thestx-haloWindows ROCm runner. This adds a non-intrusive ProcMon capture around the test run to surface exactly which libraries are absent.New scripts (
scripts/ci/windows/)procmon-start.ps1— downloads Sysinternals ProcMon, accepts EULA via registry, starts background capture to a PML backing file, exportsPROCMON_EXE/PROCMON_BACKING_FILEtoGITHUB_ENVprocmon-stop-and-report.ps1— terminates ProcMon (flushing PML), converts PML→CSV via/OpenLog /SaveAs, filters forNAME NOT FOUNDon.dll/.exe/.syspaths, writesmissing-dlls-report.txt(grouped summary + full event list), and prints the report inline in the workflow logWorkflow changes
Three steps added to
test-exe-inference, scoped exclusively tomatrix.name == 'stable-diffusion (stx-halo)':Artifact
procmon-missing-dlls-stable-diffusion (stx-halo)is uploaded even on failure and contains:procmon-capture.csv— full event log for deep inspectionmissing-dlls-report.txt— pre-filteredNAME NOT FOUNDentries with process name, PID, and path; primary candidates for a missing redistributableOriginal prompt
Create a PR against
lemonade-sdk/lemonadebased on branchsuperm1/multiple-rocm-backendsto add a ProcMon-based diagnostic step to the Windows ROCm Stable Diffusion test workflow.Context:
sd-cli.exe -M upscale ...and it crashes with exit code-1073741515(likely missing DLL/runtime).Requirements:
In
.github/workflows/cpp_server_build_test_release.yml, add steps prior to the inference tests (specifically before runningtest/server_sd.pyon the Windows ROCm matrix entry / stx-halo runner) to:sd-cli.exeinvocation to identify missing libraries/DLLs.NAME NOT FOUND, focusing on DLL load operations.procmon-missing-dlls-${{ matrix.name }}) even on failure.The ProcMon report must include:
NAME NOT FOUND).sd-cli.exe; includelemonade-server.exeif applicable).Keep the diagnostic steps scoped so they run only where needed:
Ensure ProcMon runs non-interactively (accept EULA), starts capture, stops capture, and terminates cleanly.
Add any necessary PowerShell scripts to
scripts/ci/windows/(or similar) to keep YAML readable. Ensure the workflow calls them.Update documentation/comments in workflow to explain how to interpret the artifact.
Deliverables:
Notes:
/AcceptEula,/Quiet,/BackingFile,/Minimized, andProcmon.exe /OpenLog <pml> /SaveAs <csv>for conversion.The following is the prior conversation context from the user's chat exploration (may be truncated):
User: Please find a solution for this failing job. Use the logs, job definition, and any referenced files where the failure occurred. Keep your response focused on the solution and include code suggestions when appropriate.
Assistant: The job is failing in the Stable Diffusion (ROCm) inference test at:
test/server_sd.py::StableDiffusionTests::test_020_upscale_basic(around line 650)/api/v1/images/upscalereturns 500 with{"error":{"message":"ESRGAN upscale failed","type":"server_error"}}sd-cli.exe -M upscale ...and it exits with code -1073741515 (Windows crash). That exit code is commonly a missing runtime/DLL or incompatible binary for the runner environment.Root cause (most likely)
On the stx-halo Windows runner with ROCm backend, the ESRGAN upscale path invokes a native CLI (
.\ci-cache\bin\sd-cpp\rocm-preview\sd-cli.exe) which crashes immediately. This is not a Python test issue—the test correctly expects HTTP 200, but the server can’t complete the upscale because the underlying process dies.Practical solution: don’t run ESRGAN upscale on the ROCm backend in CI
Given the crash is in the native ROCm preview binary, the fastest, most stable CI fix is to skip the ESRGAN “basic upscale” test when running with
--backend rocm(or treat it as xfail) until the ROCm sd-cli upscale mode is fixed upstream.Code change (recommended): conditionally skip in
test/server_sd.pyAdd a backend check and skip
test_020_upscale_basicfor ROCm. For example:If
ServerTestBasealready stores the backend differently, use that; the key point is to detect the backend used byrun_server_tests(..., --backend rocm)and skip only then.Complementary workflow-side improvement (optional)
In
.github/workflows/cpp_server_build_test_release.yml, the failing matrix entry is:This pull request was created from Copilot chat.