feat: add single_model_mode to force unload before load by jroth1111 · Pull Request #730 · jundot/omlx

jroth1111 · 2026-04-12T03:49:07Z

Context

This fixes a recurring production issue for oMLX on Apple Silicon laptops.

Right now:

oMLX will happily load as many models as fit inside max_model_memory
It only evicts when you go over the limit
max_model_memory was designed as an upper bound, not a target
On laptop installations with multiple models available, users end up with 2-3 loaded at near-full memory utilisation
KV cache, context windows, and the OS get squeezed out, and inference falls over with 507 errors on any reasonable context window

Users have been working around this by artificially setting max_model_memory 30-40% lower than actual available RAM. That breaks the semantics of the setting, wastes memory, and is not discoverable.

This PR adds an opt-in single_model_mode that evicts all other non-pinned loaded models before loading a different model, even when max_model_memory would otherwise allow coexistence. The goal is to preserve headroom for KV cache, context windows, and the OS on memory-constrained Apple Silicon machines. Existing behavior remains the default, pinned models remain protected, and the setting can be configured via settings, environment, CLI, or the admin API.

What this changes

When enabled, oMLX evicts all other non-pinned loaded models before loading the requested one, regardless of whether memory would allow coexistence.

The eviction block is placed in get_engine() after the "already loaded" early return and after the "too large" check, but before the existing headroom-based eviction. This means:

Same-model requests are a no-op (the early return fires first)
Models are not unloaded for requests that will fail anyway (the size check rejects first)
The existing headroom and process-memory checks still run after as a safety net (they can still matter when pinned models remain loaded)

Pinned models: Pinned models are never evicted by single_model_mode. If multiple pinned models exist, they will coexist regardless of this setting. The mode only evicts non-pinned models.

Same-model no-op: If the requested model is already loaded, nothing is evicted.

Configuration

// settings.json → model section
{
  "model": {
    "single_model_mode": true
  }
}

# CLI
omlx serve --single-model-mode
omlx serve --no-single-model-mode   # explicitly disable (overrides config)

# Environment variable
OMLX_SINGLE_MODEL_MODE=true

# Admin API (runtime toggle, no restart needed)
POST /admin/api/global-settings
{"single_model_mode": true}

Files changed

File	Change
`omlx/settings.py`	`single_model_mode: bool` on `ModelSettings` dataclass, with `to_dict`/`from_dict`/env/CLI override wiring
`omlx/engine_pool.py`	Flag on `EnginePool.__init__`, property with setter (for runtime admin toggle), eviction block in `get_engine()`, included in `get_status()`
`omlx/server.py`	Reads the setting and passes it to `EnginePool`
`omlx/cli.py`	`--single-model-mode` and `--no-single-model-mode` CLI flags
`omlx/admin/routes.py`	Added to admin settings request model, GET status endpoint, and live POST update
`tests/test_engine_pool.py`	New tests in `TestSingleModelMode`

Known limitations

Pinned models are never evicted, so multiple pinned models may still coexist when this mode is enabled.
Runtime toggling affects future model loads only. Flipping single_model_mode on via the admin API does not proactively unload currently loaded models. They are evicted on the next model switch.
SpecPrefill draft models are loaded internally by the serving engine and are not separately managed by the engine pool. They are unloaded when the parent model is evicted and must reload on next use.
This does not guarantee elimination of all OOM failures. Process memory enforcement, KV cache growth, and pinned model coexistence can still be limiting factors.

Known interaction: in-flight requests

_unload_engine() aborts any active requests on the victim model. This is the same behavior as the existing LRU eviction. With single_model_mode, a model switch mid-request will kill the in-flight connection. This is intentional — the user opted into aggressive eviction.

Test results

61 passed in tests/test_engine_pool.py (0 regressions)

TestSingleModelMode covers: default off, constructor flag, runtime setter, status reporting, eviction on switch, no-eviction when disabled, pinned model protection, pinned models kept even when incoming is pinned, same-model no-op.

🤖 Generated with Claude Code

Add a `single_model_mode` setting that unloads all other loaded models before loading the requested one, even when memory would allow multiple models to coexist. This minimizes peak memory usage during model switches on memory-constrained Apple Silicon machines. Configurable via: - settings.json: `"single_model_mode": true` under the `model` section - CLI: `omlx serve --single-model-mode` - Admin API: runtime toggle via PATCH /admin/settings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add a `single_model_mode` setting that unloads all other loaded models before loading the requested one, even when memory would allow multiple models to coexist. This minimizes peak memory usage during model switches on memory-constrained Apple Silicon machines. Behavior: - When enabled, all non-pinned models are evicted before loading a new one - Pinned models are skipped unless the incoming model is also pinned - Same-model requests (already loaded) skip eviction entirely - Runtime toggleable via admin API without restart Configurable via: - settings.json: `"single_model_mode": true` under `model` section - CLI: `omlx serve --single-model-mode` / `--no-single-model-mode` - Env: `OMLX_SINGLE_MODEL_MODE=true` - Admin API: runtime toggle via PATCH /admin/settings Includes 10 unit tests covering: default off, constructor flag, runtime setter, status reporting, eviction on switch, no-eviction when disabled, pinned model protection, pinned override, and same-model no-op. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Simplify the pinned-model semantic: pinned models are unconditionally protected, regardless of whether the incoming model is also pinned. This keeps pin semantics consistent — a pinned model stays loaded unless explicitly unloaded by the user via the admin API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gwizz and others added 4 commits April 12, 2026 13:48

Fix single-model mode review issues

1065d36

jundot force-pushed the main branch 5 times, most recently from 6670575 to 6041883 Compare April 14, 2026 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add single_model_mode to force unload before load#730

feat: add single_model_mode to force unload before load#730
jroth1111 wants to merge 4 commits intojundot:mainfrom
jroth1111:feat/single-model-mode

jroth1111 commented Apr 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jroth1111 commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

What this changes

Configuration

Files changed

Known limitations

Known interaction: in-flight requests

Test results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jroth1111 commented Apr 12, 2026 •

edited

Loading