Skip to content

feat: add single_model_mode to force unload before load#730

Open
jroth1111 wants to merge 4 commits intojundot:mainfrom
jroth1111:feat/single-model-mode
Open

feat: add single_model_mode to force unload before load#730
jroth1111 wants to merge 4 commits intojundot:mainfrom
jroth1111:feat/single-model-mode

Conversation

@jroth1111
Copy link
Copy Markdown

@jroth1111 jroth1111 commented Apr 12, 2026

Context

This fixes a recurring production issue for oMLX on Apple Silicon laptops.

Right now:

  • oMLX will happily load as many models as fit inside max_model_memory
  • It only evicts when you go over the limit
  • max_model_memory was designed as an upper bound, not a target
  • On laptop installations with multiple models available, users end up with 2-3 loaded at near-full memory utilisation
  • KV cache, context windows, and the OS get squeezed out, and inference falls over with 507 errors on any reasonable context window

Users have been working around this by artificially setting max_model_memory 30-40% lower than actual available RAM. That breaks the semantics of the setting, wastes memory, and is not discoverable.

This PR adds an opt-in single_model_mode that evicts all other non-pinned loaded models before loading a different model, even when max_model_memory would otherwise allow coexistence. The goal is to preserve headroom for KV cache, context windows, and the OS on memory-constrained Apple Silicon machines. Existing behavior remains the default, pinned models remain protected, and the setting can be configured via settings, environment, CLI, or the admin API.

What this changes

When enabled, oMLX evicts all other non-pinned loaded models before loading the requested one, regardless of whether memory would allow coexistence.

The eviction block is placed in get_engine() after the "already loaded" early return and after the "too large" check, but before the existing headroom-based eviction. This means:

  • Same-model requests are a no-op (the early return fires first)
  • Models are not unloaded for requests that will fail anyway (the size check rejects first)
  • The existing headroom and process-memory checks still run after as a safety net (they can still matter when pinned models remain loaded)

Pinned models: Pinned models are never evicted by single_model_mode. If multiple pinned models exist, they will coexist regardless of this setting. The mode only evicts non-pinned models.

Same-model no-op: If the requested model is already loaded, nothing is evicted.

Configuration

// settings.json → model section
{
  "model": {
    "single_model_mode": true
  }
}
# CLI
omlx serve --single-model-mode
omlx serve --no-single-model-mode   # explicitly disable (overrides config)

# Environment variable
OMLX_SINGLE_MODEL_MODE=true

# Admin API (runtime toggle, no restart needed)
POST /admin/api/global-settings
{"single_model_mode": true}

Files changed

File Change
omlx/settings.py single_model_mode: bool on ModelSettings dataclass, with to_dict/from_dict/env/CLI override wiring
omlx/engine_pool.py Flag on EnginePool.__init__, property with setter (for runtime admin toggle), eviction block in get_engine(), included in get_status()
omlx/server.py Reads the setting and passes it to EnginePool
omlx/cli.py --single-model-mode and --no-single-model-mode CLI flags
omlx/admin/routes.py Added to admin settings request model, GET status endpoint, and live POST update
tests/test_engine_pool.py New tests in TestSingleModelMode

Known limitations

  • Pinned models are never evicted, so multiple pinned models may still coexist when this mode is enabled.
  • Runtime toggling affects future model loads only. Flipping single_model_mode on via the admin API does not proactively unload currently loaded models. They are evicted on the next model switch.
  • SpecPrefill draft models are loaded internally by the serving engine and are not separately managed by the engine pool. They are unloaded when the parent model is evicted and must reload on next use.
  • This does not guarantee elimination of all OOM failures. Process memory enforcement, KV cache growth, and pinned model coexistence can still be limiting factors.

Known interaction: in-flight requests

_unload_engine() aborts any active requests on the victim model. This is the same behavior as the existing LRU eviction. With single_model_mode, a model switch mid-request will kill the in-flight connection. This is intentional — the user opted into aggressive eviction.

Test results

61 passed in tests/test_engine_pool.py (0 regressions)

TestSingleModelMode covers: default off, constructor flag, runtime setter, status reporting, eviction on switch, no-eviction when disabled, pinned model protection, pinned models kept even when incoming is pinned, same-model no-op.

🤖 Generated with Claude Code

gwizz and others added 4 commits April 12, 2026 13:48
Add a `single_model_mode` setting that unloads all other loaded models
before loading the requested one, even when memory would allow multiple
models to coexist. This minimizes peak memory usage during model switches
on memory-constrained Apple Silicon machines.

Configurable via:
- settings.json: `"single_model_mode": true` under the `model` section
- CLI: `omlx serve --single-model-mode`
- Admin API: runtime toggle via PATCH /admin/settings

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a `single_model_mode` setting that unloads all other loaded models
before loading the requested one, even when memory would allow multiple
models to coexist. This minimizes peak memory usage during model switches
on memory-constrained Apple Silicon machines.

Behavior:
- When enabled, all non-pinned models are evicted before loading a new one
- Pinned models are skipped unless the incoming model is also pinned
- Same-model requests (already loaded) skip eviction entirely
- Runtime toggleable via admin API without restart

Configurable via:
- settings.json: `"single_model_mode": true` under `model` section
- CLI: `omlx serve --single-model-mode` / `--no-single-model-mode`
- Env: `OMLX_SINGLE_MODEL_MODE=true`
- Admin API: runtime toggle via PATCH /admin/settings

Includes 10 unit tests covering: default off, constructor flag, runtime
setter, status reporting, eviction on switch, no-eviction when disabled,
pinned model protection, pinned override, and same-model no-op.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Simplify the pinned-model semantic: pinned models are unconditionally
protected, regardless of whether the incoming model is also pinned.
This keeps pin semantics consistent — a pinned model stays loaded
unless explicitly unloaded by the user via the admin API.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jundot jundot force-pushed the main branch 5 times, most recently from 6670575 to 6041883 Compare April 14, 2026 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant