Add diagnostic logging to chase sd-cli.exe ESRGAN upscale crash on Windows ROCm#1635
Conversation
Agent-Logs-Url: https://github.com/lemonade-sdk/lemonade/sessions/147c64c4-8c70-44b2-8db6-22525eae1261 Co-authored-by: Geramy <264964+Geramy@users.noreply.github.com>
|
@claude do this to this branch |
|
Claude finished @Geramy's task in 5m 55s —— View job Adding missing-DLL diagnostic to sd-cli crash path
The implementation is committed locally and ready to push. A push approval is needed (the What was implementedNew function
Call site in Expected output when the crash recurs
|
|
|
@claude please push |
|
Claude finished @Geramy's task in 2m 28s —— View job Missing-DLL diagnostic implemented — push needs approvalThe
What was committedNew function
Call site in Expected output when the crash recurs
Please approve the |
|
@copilot please take this over and implement this just push to this branch dont ask. |
…m upscale crash Agent-Logs-Url: https://github.com/lemonade-sdk/lemonade/sessions/f4f8efd0-0a4a-4b56-86f4-f527d62dcc7c Co-authored-by: Geramy <264964+Geramy@users.noreply.github.com>
Done in d255d16. Added
Triggered automatically when the upscale subprocess exits with |
@copilot What was the the output of the test? |
... The diagnostic logging ran on the CI runner (artifact The PE import scanner produced no The most likely explanation: the PATH passed to the subprocess contains a relative path ( The fix is to make the ROCm runtime directory path absolute when constructing the PATH for the subprocess. I can implement that if you'd like. |
sd-cli.exe -M upscalecrashes with exit code-1073741515on Windows ROCm CI runners. This adds diagnostic-only logging across the upscale path — no behavior, response, or status code changes.Changes
SDServer::upscale_via_cli— before launching sd-cli, logs:On failure, replaces the existing single-line warning with:
exit_code=-1073741515 (0xC0000135))STATUS_DLL_NOT_FOUND(0xC0000135): automatically scans sd-cli.exe's PE import table and logs every unresolvable DLL by namelog_missing_dll_imports_win32(Windows-only, new helper insd_server.cpp) — triggered when the upscale subprocess exits with0xC0000135:IMAGE_IMPORT_DESCRIPTORentry in the import tableDLL check MISSING: <name> (not in <dir>, err=<code>)for any unresolvable dependencyProcessManager::get_exit_code/wait_for_exit(Windows only) — emits aWARNINGwith decimal + hex exit code whenever a process exits non-zeroServer::handle_image_upscale— when returning 500 for an empty upscale result, logsresolved_backend, cli path, and model path before the error responseCI Run Findings
The diagnostic logging ran on the ROCm CI runner and produced the following output:
The PE import scanner found no missing static imports — all directly-imported DLLs were resolvable. However, the PATH passed to the subprocess contains a relative path (
.\ci-cache\bin\sd-cpp\rocm-preview). If the subprocess working directory differs from where the server resolves that relative path, the ROCm runtime DLLs would be invisible to the Windows loader, producingSTATUS_DLL_NOT_FOUNDeven though the files exist on disk.Original prompt
Add minimal diagnostic-only logging to help chase
sd-cli.execrash during ESRGAN upscale on Windows ROCm, stacking on top of PR #1633 head branch.Base branch/ref to use (PR head):
copilot/superm1add-procmon-diagnostic-step(SHA 3fd69d9).Do NOT change behavior (no skipping tests, no changing timeouts, no altering API responses). Only add logging.
Where to add logs:
src/cpp/server/backends/sd_server.cppinlemon::backends::SDServer::upscale_via_cli:Keep using
ProcessManager::start_process(don’t switch execution method), but you can adjust its parameters only if it is guaranteed not to change behavior; prefer no parameter changes.src/cpp/server/utils/process_manager.cpp:(Optional)
src/cpp/server/server.cppinhandle_image_upscale:Acceptance:
The following is the prior conversation context from the user's chat exploration (may be truncated):
User: Please find a solution for this failing job. Use the logs, job definition, and any referenced files where the failure occurred. Keep your response focused on the solution and include code suggestions when appropriate.
Assistant: The job is failing in the Windows ROCm Stable Diffusion matrix entry (
stable-diffusion (stx-halo)) duringtest_020_upscale_basicintest/server_sd.py.What’s actually failing
From the logs:
POST /api/v1/images/upscaleand expects 200.{"error":{"message":"ESRGAN upscale failed","type":"server_error"}}Exit code -1073741515 is a Windows crash (commonly missing DLL / dependency failure). So this is not a model download issue (download completed), it’s the ROCm sd-cli.exe upscale path crashing on that runner.
Best fix (make the ROCm SD test reliable): skip ESRGAN upscale on ROCm backend
Given this is a backend-specific crash, the most stable immediate solution is to skip the ESRGAN upscale test when running with
--backend rocm(or more generally: only require upscale on backends where it’s supported and stable).Code change suggestion (in
test/server_sd.py)Update
test_020_upscale_basicto skip when backend is ROCm. This test file already accepts--backend ...via the workflow (see theRun testsstep in.github/workflows/cpp_server_build_test_release.yml).Example patch (add near the start of
test_020_upscale_basic):To make
self.backendavailable, the simplest non-invasive approach is to have your existing test harness (utils/server_base.py/run_server_tests) store the parsed--backendonto the test class or instance. If that’s not already done, you can alternatively rely only on an env var and set it in the workflow step when calling the test.Workflow-level alternative (skip just this assertion on ROCm runs)
If you don’t want to modify the shared harness, you can pass an env var from the ROCm SD matrix entry.
In
.github/workflows/cpp_server_build_test_release.yml, in theRun testsstep, add:…but note
matrix.backendsis"rocm"for this entry, and could be"vulkan rocm"in others. For correctness you’d want to set it per-iteration (inside the PowerShell loo...This pull request was created from Copilot chat.