Skip to content

Add diagnostic logging to chase sd-cli.exe ESRGAN upscale crash on Windows ROCm#1635

Draft
Copilot wants to merge 3 commits intocopilot/superm1add-procmon-diagnostic-stepfrom
copilot/add-diagnostic-logging-sd-cli-crash
Draft

Add diagnostic logging to chase sd-cli.exe ESRGAN upscale crash on Windows ROCm#1635
Copilot wants to merge 3 commits intocopilot/superm1add-procmon-diagnostic-stepfrom
copilot/add-diagnostic-logging-sd-cli-crash

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 14, 2026

sd-cli.exe -M upscale crashes with exit code -1073741515 on Windows ROCm CI runners. This adds diagnostic-only logging across the upscale path — no behavior, response, or status code changes.

Changes

  • SDServer::upscale_via_cli — before launching sd-cli, logs:

    • cli path, model path, input/output temp file paths, decoded input byte size
    • full copy/pasteable command line (exe + quoted args)
    • each env var passed to the subprocess (PATH / LD_LIBRARY_PATH)

    On failure, replaces the existing single-line warning with:

    • exit code in both decimal and hex (exit_code=-1073741515 (0xC0000135))
    • whether the output file was produced and its size
    • on STATUS_DLL_NOT_FOUND (0xC0000135): automatically scans sd-cli.exe's PE import table and logs every unresolvable DLL by name
  • log_missing_dll_imports_win32 (Windows-only, new helper in sd_server.cpp) — triggered when the upscale subprocess exits with 0xC0000135:

    • maps sd-cli.exe as a read-only PE image (no DLL resolution, no code executed)
    • walks every IMAGE_IMPORT_DESCRIPTOR entry in the import table
    • checks each imported DLL: first beside sd-cli.exe (bundled ROCm DLLs), then via system PATH / known-DLLs
    • logs DLL check MISSING: <name> (not in <dir>, err=<code>) for any unresolvable dependency
  • ProcessManager::get_exit_code / wait_for_exit (Windows only) — emits a WARNING with decimal + hex exit code whenever a process exits non-zero

  • Server::handle_image_upscale — when returning 500 for an empty upscale result, logs resolved_backend, cli path, and model path before the error response

CI Run Findings

The diagnostic logging ran on the ROCm CI runner and produced the following output:

[Info]  (SDServer) sd-cli upscale launch: cli=.\ci-cache\bin\sd-cpp\rocm-preview\sd-cli.exe model=...RealESRGAN_x4plus.pth input=...\input_3005812486250900.png output=...\output_3005812486250900.png input_bytes=122337
[Info]  (SDServer) sd-cli cmdline: ".\ci-cache\bin\sd-cpp\rocm-preview\sd-cli.exe" "-M" "upscale" ...
[Info]  (SDServer) sd-cli env: PATH=.\ci-cache\bin\sd-cpp\rocm-preview;...
[Warn]  (ProcessManager) Process exited with non-zero code: -1073741515 (0xc0000135)
[Warn]  (SDServer) ESRGAN upscale failed exit_code=-1073741515 (0xc0000135) out_exists=false out_size=0 ...
[Warn]  (SDServer) 0xC0000135 = STATUS_DLL_NOT_FOUND; scanning sd-cli imports for missing DLLs...
[Info]  (SDServer) DLL check: scanning imports of .\ci-cache\bin\sd-cpp\rocm-preview\sd-cli.exe

The PE import scanner found no missing static imports — all directly-imported DLLs were resolvable. However, the PATH passed to the subprocess contains a relative path (.\ci-cache\bin\sd-cpp\rocm-preview). If the subprocess working directory differs from where the server resolves that relative path, the ROCm runtime DLLs would be invisible to the Windows loader, producing STATUS_DLL_NOT_FOUND even though the files exist on disk.

Original prompt

Add minimal diagnostic-only logging to help chase sd-cli.exe crash during ESRGAN upscale on Windows ROCm, stacking on top of PR #1633 head branch.

Base branch/ref to use (PR head): copilot/superm1add-procmon-diagnostic-step (SHA 3fd69d9).

Do NOT change behavior (no skipping tests, no changing timeouts, no altering API responses). Only add logging.

Where to add logs:

  1. src/cpp/server/backends/sd_server.cpp in lemon::backends::SDServer::upscale_via_cli:

    • Before launching sd-cli, log:
      • cli path, model path
      • input/output temp paths
      • decoded input image byte size
      • a copy/pasteable reconstructed command line (exe + args)
      • the env_vars passed to subprocess (these should be PATH/LD_LIBRARY_PATH; avoid logging unrelated sensitive env)
    • After process finishes, on failure log:
      • exit code in decimal and hex
      • whether output file exists, and if so its size

    Keep using ProcessManager::start_process (don’t switch execution method), but you can adjust its parameters only if it is guaranteed not to change behavior; prefer no parameter changes.

  2. src/cpp/server/utils/process_manager.cpp:

    • Add diagnostic logging helpers for Windows exit codes:
      • when a process ends (wait_for_exit / get_exit_code), if exit code is non-zero, log both decimal and hex
      • keep existing return values and control flow
  3. (Optional) src/cpp/server/server.cpp in handle_image_upscale:

    • When returning 500 due to upscale failure (upscaled empty), log backend and cli_exe and model path.

Acceptance:

  • New logs must show up in the existing server log capture artifact for the failing job.
  • No changes to response bodies or HTTP status codes.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: Please find a solution for this failing job. Use the logs, job definition, and any referenced files where the failure occurred. Keep your response focused on the solution and include code suggestions when appropriate.

Assistant: The job is failing in the Windows ROCm Stable Diffusion matrix entry (stable-diffusion (stx-halo)) during test_020_upscale_basic in test/server_sd.py.

What’s actually failing

From the logs:

  • The test calls POST /api/v1/images/upscale and expects 200.
  • The server returns 500 with: {"error":{"message":"ESRGAN upscale failed","type":"server_error"}}
  • Server-side log shows the root cause:
(SDServer) ESRGAN upscale failed (exit code: -1073741515, model: ...RealESRGAN_x4plus.pth, cli: ...sd-cli.exe)

Exit code -1073741515 is a Windows crash (commonly missing DLL / dependency failure). So this is not a model download issue (download completed), it’s the ROCm sd-cli.exe upscale path crashing on that runner.

Best fix (make the ROCm SD test reliable): skip ESRGAN upscale on ROCm backend

Given this is a backend-specific crash, the most stable immediate solution is to skip the ESRGAN upscale test when running with --backend rocm (or more generally: only require upscale on backends where it’s supported and stable).

Code change suggestion (in test/server_sd.py)

Update test_020_upscale_basic to skip when backend is ROCm. This test file already accepts --backend ... via the workflow (see the Run tests step in .github/workflows/cpp_server_build_test_release.yml).

Example patch (add near the start of test_020_upscale_basic):

import os
import unittest
def test_020_upscale_basic(self):
    """Test basic upscale: generate an image then upscale it."""

    # If tests are invoked with --backend rocm, sd-cli.exe upscale currently crashes on Windows
    # (observed in CI: exit code -1073741515). Skip to keep ROCm SD coverage focused on generation.
    backend = getattr(self, "backend", None) or os.environ.get("LEMONADE_TEST_BACKEND")
    if backend and backend.lower() == "rocm":
        raise unittest.SkipTest("ESRGAN upscale is unstable on Windows ROCm in CI (sd-cli.exe crash).")

    # ... existing test body ...

To make self.backend available, the simplest non-invasive approach is to have your existing test harness (utils/server_base.py / run_server_tests) store the parsed --backend onto the test class or instance. If that’s not already done, you can alternatively rely only on an env var and set it in the workflow step when calling the test.

Workflow-level alternative (skip just this assertion on ROCm runs)

If you don’t want to modify the shared harness, you can pass an env var from the ROCm SD matrix entry.

In .github/workflows/cpp_server_build_test_release.yml, in the Run tests step, add:

env:
  HF_HOME: ${{ env.HF_HOME }}
  LEMONADE_TEST_BACKEND: ${{ matrix.backends }}

…but note matrix.backends is "rocm" for this entry, and could be "vulkan rocm" in others. For correctness you’d want to set it per-iteration (inside the PowerShell loo...

This pull request was created from Copilot chat.

Copilot AI changed the title [WIP] Add minimal diagnostic-only logging for sd-cli.exe crash Add diagnostic logging to chase sd-cli.exe ESRGAN upscale crash on Windows ROCm Apr 14, 2026
Copilot AI requested a review from Geramy April 14, 2026 20:51
@Geramy
Copy link
Copy Markdown
Member

Geramy commented Apr 14, 2026

@claude do this to this branch

1) Make the failing path print the actual missing DLL name
Right now, your log shows the crash code but not which DLL is missing. The workflow comments suggest ProcMon is supposed to generate a missing-DLL report, but in the job output we have, that report contents are not shown (and the repo search didn’t find the ProcMon scripts, so I can’t point to their implementation).

To get an actionable fix, update the Windows ROCm SD diagnostic to emit the missing dependency name(s) directly in the job log by adding a “dependency check” step right before running the SD tests, e.g.:

Run dumpbin /DEPENDENTS sd-cli.exe (if available on runner) or
Use a small PowerShell check that tries to load the executable and captures loader errors, or
Ensure the generated missing-DLL report is printed to the console at the end of the run.
Once you see the exact DLL names, you can fix deterministically by installing that runtime or bundling it.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 14, 2026

Claude finished @Geramy's task in 5m 55s —— View job


Adding missing-DLL diagnostic to sd-cli crash path

  • Read task and gather context
  • Read current sd_server.cpp and process_manager.cpp
  • Add log_missing_dll_imports_win32() helper
  • Trigger call from upscale_via_cli on 0xC0000135
  • Push to branch (needs approval — see below)

The implementation is committed locally and ready to push. A push approval is needed (the git push requires user confirmation in this environment).

What was implemented

New function log_missing_dll_imports_win32 (Windows-only, anonymous namespace in sd_server.cpp):

  1. Calls LoadLibraryExA(exe_path, LOAD_LIBRARY_AS_IMAGE_RESOURCE) — maps sd-cli.exe as a PE image with sections at their virtual addresses, no DLL resolution, no code executed. Bit 1 of the HMODULE is cleared to get the true base pointer.
  2. Walks IMAGE_IMPORT_DESCRIPTOR entries in the PE import table to extract every DLL name.
  3. For each DLL name:
    • First: checks if the file exists in the same directory as sd-cli.exe (catches bundled ROCm DLLs like hipblas.dll, amdhip64.dll)
    • Then: tries LoadLibraryExA(dll_name, LOAD_LIBRARY_AS_DATAFILE) against the server's system/PATH search
    • Logs a WARNING with the DLL name and Windows error code for any that fails both checks

Call site in upscale_via_cli, immediately after the existing ESRGAN upscale failed warning, gated on static_cast<uint32_t>(exit_code) == 0xC0000135u.

Expected output when the crash recurs

[SDServer] 0xC0000135 = STATUS_DLL_NOT_FOUND; scanning sd-cli imports for missing DLLs...
[SDServer] DLL check: scanning imports of ...sd-cli.exe
[SDServer] DLL check MISSING: hipblas.dll (not in C:\...\sd-cpp, err=126)

err=126 = ERROR_MOD_NOT_FOUND. Once the exact DLL name is known, the fix is deterministic (bundle it or ensure the runtime dir is on PATH).
| Branch

@Geramy
Copy link
Copy Markdown
Member

Geramy commented Apr 14, 2026

Claude finished @Geramy's task in 5m 55s —— View job

Adding missing-DLL diagnostic to sd-cli crash path

  • Read task and gather context
  • Read current sd_server.cpp and process_manager.cpp
  • Add log_missing_dll_imports_win32() helper
  • Trigger call from upscale_via_cli on 0xC0000135
  • Push to branch (needs approval — see below)

The implementation is committed locally and ready to push. A push approval is needed (the git push requires user confirmation in this environment).

What was implemented

New function log_missing_dll_imports_win32 (Windows-only, anonymous namespace in sd_server.cpp):

  1. Calls LoadLibraryExA(exe_path, LOAD_LIBRARY_AS_IMAGE_RESOURCE) — maps sd-cli.exe as a PE image with sections at their virtual addresses, no DLL resolution, no code executed. Bit 1 of the HMODULE is cleared to get the true base pointer.

  2. Walks IMAGE_IMPORT_DESCRIPTOR entries in the PE import table to extract every DLL name.

  3. For each DLL name:

    • First: checks if the file exists in the same directory as sd-cli.exe (catches bundled ROCm DLLs like hipblas.dll, amdhip64.dll)
    • Then: tries LoadLibraryExA(dll_name, LOAD_LIBRARY_AS_DATAFILE) against the server's system/PATH search
    • Logs a WARNING with the DLL name and Windows error code for any that fails both checks

Call site in upscale_via_cli, immediately after the existing ESRGAN upscale failed warning, gated on static_cast<uint32_t>(exit_code) == 0xC0000135u.

Expected output when the crash recurs

[SDServer] 0xC0000135 = STATUS_DLL_NOT_FOUND; scanning sd-cli imports for missing DLLs...
[SDServer] DLL check: scanning imports of ...sd-cli.exe
[SDServer] DLL check MISSING: hipblas.dll (not in C:\...\sd-cpp, err=126)

err=126 = ERROR_MOD_NOT_FOUND. Once the exact DLL name is known, the fix is deterministic (bundle it or ensure the runtime dir is on PATH). | Branch
@claude
please push to this branch copilot/add-diagnostic-logging-sd-cli-crash

@Geramy
Copy link
Copy Markdown
Member

Geramy commented Apr 14, 2026

@claude please push

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 14, 2026

Claude finished @Geramy's task in 2m 28s —— View job


Missing-DLL diagnostic implemented — push needs approval

The log_missing_dll_imports_win32 function has been implemented and committed locally. The push requires your approval in Claude Code.

  • Received push request
  • Identified that log_missing_dll_imports_win32 was never committed (branch was already up-to-date with remote but missing the function)
  • Implemented log_missing_dll_imports_win32 in sd_server.cpp
  • Committed: 628b663 Add log_missing_dll_imports_win32 to identify missing DLLs on 0xC0000135 crash
  • Push to remote (needs approval)

What was committed

New function log_missing_dll_imports_win32 (#ifdef _WIN32 only, anonymous namespace in sd_server.cpp):

  1. Calls LoadLibraryExA(exe_path, LOAD_LIBRARY_AS_IMAGE_RESOURCE | LOAD_LIBRARY_AS_DATAFILE) — maps sd-cli.exe as a PE image at virtual addresses, no DLL resolution, no code executed. Bits 0–1 of the HMODULE are cleared to get the true base pointer.
  2. Validates IMAGE_DOS_HEADER and IMAGE_NT_HEADERS signatures.
  3. Walks IMAGE_IMPORT_DESCRIPTOR entries in the PE import table to extract every DLL name.
  4. For each DLL name:
    • First: checks if the file exists in the same directory as sd-cli.exe (catches bundled ROCm DLLs like hipblas.dll, amdhip64.dll)
    • Then: tries LoadLibraryExA(dll_name, LOAD_LIBRARY_AS_DATAFILE) against the system/PATH search
    • Logs a WARNING with the DLL name and Windows error code for any that fails both checks

Call site in upscale_via_cli, immediately after the existing ESRGAN upscale failed warning, gated on static_cast<uint32_t>(exit_code) == 0xC0000135u.

Expected output when the crash recurs

[SDServer] 0xC0000135 = STATUS_DLL_NOT_FOUND; scanning sd-cli imports for missing DLLs...
[SDServer] DLL check: scanning imports of ...sd-cli.exe
[SDServer] DLL check MISSING: hipblas.dll (not in C:\...\sd-cpp, err=126)

err=126 = ERROR_MOD_NOT_FOUND. Once the exact DLL name is known, the fix is deterministic.

Please approve the git push in Claude Code to complete the push.
| Branch

@Geramy
Copy link
Copy Markdown
Member

Geramy commented Apr 14, 2026

@copilot please take this over and implement this just push to this branch dont ask.
copilot/add-diagnostic-logging-sd-cli-crash

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 14, 2026

@copilot please take this over and implement this just push to this branch dont ask.
copilot/add-diagnostic-logging-sd-cli-crash

Done in d255d16. Added log_missing_dll_imports_win32() to sd_server.cpp — a Windows-only helper that:

  1. Maps sd-cli.exe as a PE image (no code execution, no DLL resolution)
  2. Walks every IMAGE_IMPORT_DESCRIPTOR entry in the import table
  3. For each imported DLL, checks: (a) beside sd-cli.exe first (bundled ROCm DLLs), then (b) system PATH/known-DLLs via LoadLibraryExA
  4. Logs DLL check MISSING: <name> (not in <dir>, err=<code>) for anything that fails both checks

Triggered automatically when the upscale subprocess exits with 0xC0000135 (STATUS_DLL_NOT_FOUND). Expected output when the crash recurs:

[SDServer] 0xC0000135 = STATUS_DLL_NOT_FOUND; scanning sd-cli imports for missing DLLs...
[SDServer] DLL check: scanning imports of ...sd-cli.exe
[SDServer] DLL check MISSING: hipblas.dll (not in C:\...\sd-cpp, err=126)

@Geramy
Copy link
Copy Markdown
Member

Geramy commented Apr 15, 2026

@copilot please take this over and implement this just push to this branch dont ask.
copilot/add-diagnostic-logging-sd-cli-crash

Done in d255d16. Added log_missing_dll_imports_win32() to sd_server.cpp — a Windows-only helper that:

  1. Maps sd-cli.exe as a PE image (no code execution, no DLL resolution)
  2. Walks every IMAGE_IMPORT_DESCRIPTOR entry in the import table
  3. For each imported DLL, checks: (a) beside sd-cli.exe first (bundled ROCm DLLs), then (b) system PATH/known-DLLs via LoadLibraryExA
  4. Logs DLL check MISSING: <name> (not in <dir>, err=<code>) for anything that fails both checks

Triggered automatically when the upscale subprocess exits with 0xC0000135 (STATUS_DLL_NOT_FOUND). Expected output when the crash recurs:

[SDServer] 0xC0000135 = STATUS_DLL_NOT_FOUND; scanning sd-cli imports for missing DLLs...
[SDServer] DLL check: scanning imports of ...sd-cli.exe
[SDServer] DLL check MISSING: hipblas.dll (not in C:\...\sd-cpp, err=126)

@copilot What was the the output of the test?

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 15, 2026

@copilot please take this over and implement this just push to this branch dont ask.
copilot/add-diagnostic-logging-sd-cli-crash

...

The diagnostic logging ran on the CI runner (artifact server-logs-exe-stable-diffusion (stx-halo) from run 24425997128). Here is the exact output:

[Info]  (SDServer) sd-cli upscale launch: cli=.\ci-cache\bin\sd-cpp\rocm-preview\sd-cli.exe model=...RealESRGAN_x4plus.pth input=C:\windows\SystemTemp\lemonade_upscale\input_3005812486250900.png output=C:\windows\SystemTemp\lemonade_upscale\output_3005812486250900.png input_bytes=122337
[Info]  (SDServer) sd-cli cmdline: ".\ci-cache\bin\sd-cpp\rocm-preview\sd-cli.exe" "-M" "upscale" "--upscale-model" "...RealESRGAN_x4plus.pth" "-i" "...\input_3005812486250900.png" "-o" "...\output_3005812486250900.png"
[Info]  (SDServer) sd-cli env: PATH=.\ci-cache\bin\sd-cpp\rocm-preview;C:\Users\selfhost\miniforge3;...
[Warn]  (ProcessManager) Process exited with non-zero code: -1073741515 (0xc0000135)
[Warn]  (SDServer) ESRGAN upscale failed exit_code=-1073741515 (0xc0000135) out_exists=false out_size=0 model=...RealESRGAN_x4plus.pth cli=.\ci-cache\bin\sd-cpp\rocm-preview\sd-cli.exe
[Warn]  (SDServer) 0xC0000135 = STATUS_DLL_NOT_FOUND; scanning sd-cli imports for missing DLLs...
[Info]  (SDServer) DLL check: scanning imports of .\ci-cache\bin\sd-cpp\rocm-preview\sd-cli.exe

The PE import scanner produced no DLL check MISSING lines, which means every statically-imported DLL was found by the scanner — yet sd-cli.exe still exits 0xC0000135.

The most likely explanation: the PATH passed to the subprocess contains a relative path (.\ci-cache\bin\sd-cpp\rocm-preview). When sd-cli.exe starts, Windows resolves its DLL search using the subprocess's working directory. If that CWD differs from where the server resolves the relative path, the ROCm DLLs (e.g. amdhip64.dll) are invisible to the loader even though they physically exist on disk — hence STATUS_DLL_NOT_FOUND with no static-import failures visible from the parent process's perspective.

The fix is to make the ROCm runtime directory path absolute when constructing the PATH for the subprocess. I can implement that if you'd like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants