Skip to content

fix: make /load declarative/idempotent to prevent TOCTOU race#1604

Open
ianbmacdonald wants to merge 2 commits intolemonade-sdk:mainfrom
ianbmacdonald:fix/load-idempotent
Open

fix: make /load declarative/idempotent to prevent TOCTOU race#1604
ianbmacdonald wants to merge 2 commits intolemonade-sdk:mainfrom
ianbmacdonald:fix/load-idempotent

Conversation

@ianbmacdonald
Copy link
Copy Markdown
Collaborator

Summary

  • Fixes a TOCTOU race in handle_load where is_model_loaded(), unload_model(), and load_model() each acquired/released load_mutex_ independently, allowing concurrent requests to cause unnecessary eviction and reload of large models (~90s wasted for 68GB models)
  • Moves the "already loaded?" decision into Router::load_model() under the existing load_mutex_, with an allow_reload_on_option_change parameter for explicit /load callers
  • Redefines /load as declarative ("ensure loaded with these options") — same-options is a no-op, different-options atomically evicts and reloads
  • Adds three integration tests: concurrent race regression, sequential idempotency, and option-change reload

Closes #1603

Test plan

  • Build passes (cmake --build --preset default)
  • Concurrent auto-load + /load for the same model: second arrival no-ops, no eviction in logs
  • Sequential /load with same options: backend_url unchanged (same subprocess, no restart)
  • Sequential /load with different ctx_size: evicts and reloads with new options
  • Full server_endpoints.py suite: 35/35 tests pass on Debian 13 x86_64

🤖 Generated with Claude Code

ianbmacdonald and others added 2 commits April 10, 2026 12:32
The /load endpoint previously did a non-atomic check-unload-reload
sequence outside the Router's load_mutex_, causing unnecessary eviction
and reload of large models when concurrent requests raced (e.g., an
inference-triggered auto-load and an explicit /load for the same model).

Move the "already loaded?" decision into Router::load_model() under the
existing load_mutex_. Add allow_reload_on_option_change parameter so
/load callers can opt into reload-if-options-differ behavior while
auto-load callers remain conservative.

This redefines /load as declarative: "ensure model is loaded with these
options" rather than "always restart." Same-options /load is now a
no-op; different-options /load atomically evicts and reloads.

Tested on Debian 13 (ai4, x86_64) with the patched .deb package:
- Concurrent auto-load + /load: second arrival no-ops (no eviction)
- Sequential idempotent /load: backend_url unchanged (same subprocess)
- Sequential option-change /load: evicts and reloads with new options
- Full server_endpoints.py suite: 35/35 tests pass

Closes lemonade-sdk#1603

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- test_012a: replace backend_url comparison with wall-clock time
  (backend_url is not a stable identity — choose_port can pick the
  same port after a restart)
- test_012c: replace non-deterministic concurrent test with a
  sequential scenario that deterministically reproduces the lemonade-sdk#1603
  race: load via inference first, then /load. Wall-clock time proves
  whether a reload occurred (0.002s no-op vs seconds for evict+reload)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ianbmacdonald ianbmacdonald marked this pull request as ready for review April 10, 2026 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: make /load idempotent to prevent TOCTOU race causing unnecessary model reloads

1 participant