Skip to content

Forward GGML_METAL_NO_RESIDENCY env var to llama-server on macOS#1526

Open
Geramy wants to merge 4 commits intomainfrom
macos_llamacp_apr3_26_fix
Open

Forward GGML_METAL_NO_RESIDENCY env var to llama-server on macOS#1526
Geramy wants to merge 4 commits intomainfrom
macos_llamacp_apr3_26_fix

Conversation

@Geramy
Copy link
Copy Markdown
Member

@Geramy Geramy commented Apr 3, 2026

Summary

  • Explicitly forwards GGML_METAL_NO_RESIDENCY from lemond's environment to the llama-server subprocess on macOS
  • When lemond is started by launchd (e.g. after .pkg install), it does not inherit the shell environment, so the env var never reaches llama-server
  • This fixes llama-server b8648 crashing on macOS CI runners (MTLGPUFamilyApple5 paravirtualized GPU) due to unsupported Metal residency sets

Test plan

  • Verify macOS CI tests pass with GGML_METAL_NO_RESIDENCY=1 set in workflow
  • Verify no regression on macOS when the env var is not set (normal user machines)

Geramy added 4 commits April 3, 2026 11:42
When lemond is started by launchd (e.g. after .pkg install), it does
not inherit the shell environment. This explicitly forwards the
GGML_METAL_NO_RESIDENCY env var to the llama-server subprocess so
Metal residency sets can be disabled on paravirtualized GPUs like
GitHub Actions macOS runners (MTLGPUFamilyApple5).
Always set GGML_METAL_NO_RESIDENCY=1 when launching llama-server
unless the user has explicitly set the variable themselves. Residency
sets crash on paravirtualized GPUs (e.g. GitHub Actions macOS runners
with MTLGPUFamilyApple5).
@Geramy
Copy link
Copy Markdown
Member Author

Geramy commented Apr 4, 2026

There may not be a fix for this until we get a m series processor as a runner. See below.

- ggml-org/llama.cpp#16266 is the closest issue — Metal crashes on limited/older hardware after the "make backend async" commit. It was    
  closed with "upgrade your macOS" as the resolution.
  - PR #18738 attempted a page-alignment fix but was closed without merge. The maintainer (ggerganov) wants to find the root cause rather    
  than add a workaround, but can't reproduce it on modern hardware.                                                                          
  - No one has reported the specific issue with GitHub Actions paravirtualized GPUs (MTLGPUFamilyApple5). These aren't real Apple Silicon —
  they're VM-emulated GPUs with very limited capabilities.                                                                                   
                             
  The llama.cpp team isn't actively working on this because it only affects old macOS / limited virtualized GPUs that real users wouldn't run
   inference on. The CI runners just happen to have these fake GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant