fix: make /load declarative/idempotent to prevent TOCTOU race#1604
Open
ianbmacdonald wants to merge 2 commits intolemonade-sdk:mainfrom
Open
fix: make /load declarative/idempotent to prevent TOCTOU race#1604ianbmacdonald wants to merge 2 commits intolemonade-sdk:mainfrom
ianbmacdonald wants to merge 2 commits intolemonade-sdk:mainfrom
Conversation
The /load endpoint previously did a non-atomic check-unload-reload sequence outside the Router's load_mutex_, causing unnecessary eviction and reload of large models when concurrent requests raced (e.g., an inference-triggered auto-load and an explicit /load for the same model). Move the "already loaded?" decision into Router::load_model() under the existing load_mutex_. Add allow_reload_on_option_change parameter so /load callers can opt into reload-if-options-differ behavior while auto-load callers remain conservative. This redefines /load as declarative: "ensure model is loaded with these options" rather than "always restart." Same-options /load is now a no-op; different-options /load atomically evicts and reloads. Tested on Debian 13 (ai4, x86_64) with the patched .deb package: - Concurrent auto-load + /load: second arrival no-ops (no eviction) - Sequential idempotent /load: backend_url unchanged (same subprocess) - Sequential option-change /load: evicts and reloads with new options - Full server_endpoints.py suite: 35/35 tests pass Closes lemonade-sdk#1603 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- test_012a: replace backend_url comparison with wall-clock time (backend_url is not a stable identity — choose_port can pick the same port after a restart) - test_012c: replace non-deterministic concurrent test with a sequential scenario that deterministically reproduces the lemonade-sdk#1603 race: load via inference first, then /load. Wall-clock time proves whether a reload occurred (0.002s no-op vs seconds for evict+reload) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
handle_loadwhereis_model_loaded(),unload_model(), andload_model()each acquired/releasedload_mutex_independently, allowing concurrent requests to cause unnecessary eviction and reload of large models (~90s wasted for 68GB models)Router::load_model()under the existingload_mutex_, with anallow_reload_on_option_changeparameter for explicit/loadcallers/loadas declarative ("ensure loaded with these options") — same-options is a no-op, different-options atomically evicts and reloadsCloses #1603
Test plan
cmake --build --preset default)/loadfor the same model: second arrival no-ops, no eviction in logs/loadwith same options:backend_urlunchanged (same subprocess, no restart)/loadwith differentctx_size: evicts and reloads with new optionsserver_endpoints.pysuite: 35/35 tests pass on Debian 13 x86_64🤖 Generated with Claude Code