feat: add single_model_mode to force unload before load#730
Open
jroth1111 wants to merge 4 commits intojundot:mainfrom
Open
feat: add single_model_mode to force unload before load#730jroth1111 wants to merge 4 commits intojundot:mainfrom
jroth1111 wants to merge 4 commits intojundot:mainfrom
Conversation
Add a `single_model_mode` setting that unloads all other loaded models before loading the requested one, even when memory would allow multiple models to coexist. This minimizes peak memory usage during model switches on memory-constrained Apple Silicon machines. Configurable via: - settings.json: `"single_model_mode": true` under the `model` section - CLI: `omlx serve --single-model-mode` - Admin API: runtime toggle via PATCH /admin/settings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a `single_model_mode` setting that unloads all other loaded models before loading the requested one, even when memory would allow multiple models to coexist. This minimizes peak memory usage during model switches on memory-constrained Apple Silicon machines. Behavior: - When enabled, all non-pinned models are evicted before loading a new one - Pinned models are skipped unless the incoming model is also pinned - Same-model requests (already loaded) skip eviction entirely - Runtime toggleable via admin API without restart Configurable via: - settings.json: `"single_model_mode": true` under `model` section - CLI: `omlx serve --single-model-mode` / `--no-single-model-mode` - Env: `OMLX_SINGLE_MODEL_MODE=true` - Admin API: runtime toggle via PATCH /admin/settings Includes 10 unit tests covering: default off, constructor flag, runtime setter, status reporting, eviction on switch, no-eviction when disabled, pinned model protection, pinned override, and same-model no-op. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Simplify the pinned-model semantic: pinned models are unconditionally protected, regardless of whether the incoming model is also pinned. This keeps pin semantics consistent — a pinned model stays loaded unless explicitly unloaded by the user via the admin API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6670575 to
6041883
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
This fixes a recurring production issue for oMLX on Apple Silicon laptops.
Right now:
max_model_memorymax_model_memorywas designed as an upper bound, not a targetUsers have been working around this by artificially setting
max_model_memory30-40% lower than actual available RAM. That breaks the semantics of the setting, wastes memory, and is not discoverable.This PR adds an opt-in
single_model_modethat evicts all other non-pinned loaded models before loading a different model, even whenmax_model_memorywould otherwise allow coexistence. The goal is to preserve headroom for KV cache, context windows, and the OS on memory-constrained Apple Silicon machines. Existing behavior remains the default, pinned models remain protected, and the setting can be configured via settings, environment, CLI, or the admin API.What this changes
When enabled, oMLX evicts all other non-pinned loaded models before loading the requested one, regardless of whether memory would allow coexistence.
The eviction block is placed in
get_engine()after the "already loaded" early return and after the "too large" check, but before the existing headroom-based eviction. This means:Pinned models: Pinned models are never evicted by
single_model_mode. If multiple pinned models exist, they will coexist regardless of this setting. The mode only evicts non-pinned models.Same-model no-op: If the requested model is already loaded, nothing is evicted.
Configuration
Files changed
omlx/settings.pysingle_model_mode: boolonModelSettingsdataclass, withto_dict/from_dict/env/CLI override wiringomlx/engine_pool.pyEnginePool.__init__, property with setter (for runtime admin toggle), eviction block inget_engine(), included inget_status()omlx/server.pyEnginePoolomlx/cli.py--single-model-modeand--no-single-model-modeCLI flagsomlx/admin/routes.pytests/test_engine_pool.pyTestSingleModelModeKnown limitations
single_model_modeon via the admin API does not proactively unload currently loaded models. They are evicted on the next model switch.Known interaction: in-flight requests
_unload_engine()aborts any active requests on the victim model. This is the same behavior as the existing LRU eviction. Withsingle_model_mode, a model switch mid-request will kill the in-flight connection. This is intentional — the user opted into aggressive eviction.Test results
61 passed in
tests/test_engine_pool.py(0 regressions)TestSingleModelModecovers: default off, constructor flag, runtime setter, status reporting, eviction on switch, no-eviction when disabled, pinned model protection, pinned models kept even when incoming is pinned, same-model no-op.🤖 Generated with Claude Code