Skip to content

Add ProcMon DLL diagnostic for Windows ROCm Stable Diffusion CI#1633

Draft
Copilot wants to merge 3 commits intosuperm1/multiple-rocm-backendsfrom
copilot/superm1add-procmon-diagnostic-step
Draft

Add ProcMon DLL diagnostic for Windows ROCm Stable Diffusion CI#1633
Copilot wants to merge 3 commits intosuperm1/multiple-rocm-backendsfrom
copilot/superm1add-procmon-diagnostic-step

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 14, 2026

sd-cli.exe crashes with exit code -1073741515 (missing DLL/runtime) during the ESRGAN upscale test on the stx-halo Windows ROCm runner. This adds a non-intrusive ProcMon capture around the test run to surface exactly which libraries are absent.

New scripts (scripts/ci/windows/)

  • procmon-start.ps1 — downloads Sysinternals ProcMon, accepts EULA via registry, starts background capture to a PML backing file, exports PROCMON_EXE / PROCMON_BACKING_FILE to GITHUB_ENV
  • procmon-stop-and-report.ps1 — terminates ProcMon (flushing PML), converts PML→CSV via /OpenLog /SaveAs, filters for NAME NOT FOUND on .dll/.exe/.sys paths, writes missing-dlls-report.txt (grouped summary + full event list), and prints the report inline in the workflow log

Workflow changes

Three steps added to test-exe-inference, scoped exclusively to matrix.name == 'stable-diffusion (stx-halo)':

- name: Start ProcMon capture (ROCm SD diagnostic)
  if: ${{ matrix.name == 'stable-diffusion (stx-halo)' }}
  shell: PowerShell
  run: scripts\ci\windows\procmon-start.ps1

# ... existing Run tests step unchanged ...

- name: Stop ProcMon and generate DLL report (ROCm SD diagnostic)
  if: ${{ always() && matrix.name == 'stable-diffusion (stx-halo)' }}
  shell: PowerShell
  run: scripts\ci\windows\procmon-stop-and-report.ps1

- name: Upload ProcMon diagnostic artifacts (ROCm SD diagnostic)
  if: ${{ always() && matrix.name == 'stable-diffusion (stx-halo)' }}
  uses: actions/upload-artifact@v7
  with:
    name: procmon-missing-dlls-${{ matrix.name }}
    path: |
      ${{ env.PROCMON_REPORT_DIR }}
      ${{ env.PROCMON_BACKING_FILE }}

Artifact procmon-missing-dlls-stable-diffusion (stx-halo) is uploaded even on failure and contains:

  • procmon-capture.csv — full event log for deep inspection
  • missing-dlls-report.txt — pre-filtered NAME NOT FOUND entries with process name, PID, and path; primary candidates for a missing redistributable
Original prompt

Create a PR against lemonade-sdk/lemonade based on branch superm1/multiple-rocm-backends to add a ProcMon-based diagnostic step to the Windows ROCm Stable Diffusion test workflow.

Context:

Requirements:

  1. In .github/workflows/cpp_server_build_test_release.yml, add steps prior to the inference tests (specifically before running test/server_sd.py on the Windows ROCm matrix entry / stx-halo runner) to:

    • Download and install Sysinternals ProcMon (Procmon.exe) in the runner workspace or temp.
    • Run ProcMon capture around the server process and/or around the failing sd-cli.exe invocation to identify missing libraries/DLLs.
    • Convert the ProcMon capture to a readable format (CSV) and filter for library load failures such as NAME NOT FOUND, focusing on DLL load operations.
    • Upload the resulting report (and optionally the PML) as a workflow artifact (e.g., procmon-missing-dlls-${{ matrix.name }}) even on failure.
  2. The ProcMon report must include:

    • A list of required libraries attempted to be loaded.
    • Which ones were not installed/available (e.g., marked NAME NOT FOUND).
    • The process name and path that attempted the load (at minimum sd-cli.exe; include lemonade-server.exe if applicable).
  3. Keep the diagnostic steps scoped so they run only where needed:

    • Only on Windows.
    • Ideally only for the Stable Diffusion ROCm matrix case to reduce runtime/noise.
  4. Ensure ProcMon runs non-interactively (accept EULA), starts capture, stops capture, and terminates cleanly.

  5. Add any necessary PowerShell scripts to scripts/ci/windows/ (or similar) to keep YAML readable. Ensure the workflow calls them.

  6. Update documentation/comments in workflow to explain how to interpret the artifact.

Deliverables:

  • A PR with changes to the workflow and any new scripts.
  • Artifacts uploaded in CI: ProcMon CSV report and filtered missing-DLL list.

Notes:

  • The repo already has Windows PowerShell steps; implement in PowerShell.
  • ProcMon CLI supports /AcceptEula, /Quiet, /BackingFile, /Minimized, and Procmon.exe /OpenLog <pml> /SaveAs <csv> for conversion.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: Please find a solution for this failing job. Use the logs, job definition, and any referenced files where the failure occurred. Keep your response focused on the solution and include code suggestions when appropriate.

Assistant: The job is failing in the Stable Diffusion (ROCm) inference test at:

  • Failing test: test/server_sd.py::StableDiffusionTests::test_020_upscale_basic (around line 650)
  • Symptom: /api/v1/images/upscale returns 500 with {"error":{"message":"ESRGAN upscale failed","type":"server_error"}}
  • Server-side detail from logs: the server launches sd-cli.exe -M upscale ... and it exits with code -1073741515 (Windows crash). That exit code is commonly a missing runtime/DLL or incompatible binary for the runner environment.

Root cause (most likely)

On the stx-halo Windows runner with ROCm backend, the ESRGAN upscale path invokes a native CLI (.\ci-cache\bin\sd-cpp\rocm-preview\sd-cli.exe) which crashes immediately. This is not a Python test issue—the test correctly expects HTTP 200, but the server can’t complete the upscale because the underlying process dies.

Practical solution: don’t run ESRGAN upscale on the ROCm backend in CI

Given the crash is in the native ROCm preview binary, the fastest, most stable CI fix is to skip the ESRGAN “basic upscale” test when running with --backend rocm (or treat it as xfail) until the ROCm sd-cli upscale mode is fixed upstream.

Code change (recommended): conditionally skip in test/server_sd.py

Add a backend check and skip test_020_upscale_basic for ROCm. For example:

import os
import unittest

# ...

class StableDiffusionTests(ServerTestBase):

    def test_020_upscale_basic(self):
        # Skip upscale on ROCm until sd-cli upscale is stable on Windows ROCm runners
        backend = getattr(self, "backend", None) or os.environ.get("LEMONADE_TEST_BACKEND", "")
        if str(backend).lower() == "rocm":
            raise unittest.SkipTest("Skipping ESRGAN upscale on ROCm backend (sd-cli crash in CI)")

        # existing test continues unchanged...

If ServerTestBase already stores the backend differently, use that; the key point is to detect the backend used by run_server_tests(..., --backend rocm) and skip only then.

Complementary workflow-side improvement (optional)

In .github/workflows/cpp_server_build_test_release.yml, the failing matrix entry is:

  • `stable-diffusion (stx-...

This pull request was created from Copilot chat.

Copilot AI changed the title [WIP] Add ProcMon-based diagnostic step to Windows ROCm test workflow Add ProcMon DLL diagnostic for Windows ROCm Stable Diffusion CI Apr 14, 2026
Copilot AI requested a review from Geramy April 14, 2026 17:07
Increased timeout for CSV file readiness check and added a function to test file readability.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants