Skip to content

Releases: nullata/llamaMan

1.2.1

01 Jun 23:37

Choose a tag to compare

[1.2.1] - 2026-06-02

Added

  • Multi-node clustering: several LlamaMan deployments can now run as one logical cluster. Clustering is opt-in (CLUSTER_ENABLED plus a shared CLUSTER_SECRET) and entirely inert for single-node installs. Nodes discover each other through the database storage backend's shared node registry (register_node / list_nodes) rather than pairwise configuration, so any node added anywhere becomes visible to all; every node-to-node call carries the secret in an X-Cluster-Secret header (never as a client bearer token), and each node advertises how peers reach it via CLUSTER_ADVERTISE_URL. The dashboard gains per-node System and GPU monitoring cards and a Cluster settings tab showing each node's identity, advertise URL, heartbeat age (stamped on the database's clock so node-to-node skew can't flap a healthy node offline), and an HTTP-reachability badge. A control relay (/api/cluster/nodes/<id>/proxy/...) lets the UI drive launches, image pulls, and downloads on any node, with node selectors added to the launch, images, and downloads forms. New api/cluster.py, core/cluster.py, static/js/cluster.js, and an extensive tests/test_cluster.py.
  • Shared inference queue and cross-node least-load dispatch: with Share queue with same model enabled, instances of a model across nodes form a group, and an inference request is routed to the group node with the fewest in-flight requests; a queued request can migrate to a freer peer, with the hop chain bounded by MAX_HOPS and guarded against loops. An optional Queue group name pools same-family / different-quant instances under one alias (which also becomes the llama-server --alias the instance advertises), and a Fallback only flag marks instances that should serve only when every non-fallback member of the group is at capacity or unreachable. The cluster heartbeat runs on its own dedicated thread (CLUSTER_HEARTBEAT_INTERVAL_S) so forwarded inference on the shared worker can't starve it and flap nodes offline.
  • Request logging page (/logging, linked from the dashboard header): a dedicated page that rolls up recorded inference traffic into summary tiles (request count with errors, average and peak throughput, average TTFT and latency, prompt/completion/total tokens), a time-window selector (24h / 7d / 30d / all), a recent-conversations list, and a per-conversation drill-down showing each turn's prompt/response and metrics. New templates/logging.html and static/js/logging.js.
  • Per-node settings: settings that must differ per host - the node's Docker images and the model-cap eviction toggles - are now scoped under settings["nodes"][<node_id>] instead of being shared cluster-wide, while reads transparently fall back to the legacy top-level value so existing single-node installs upgrade with zero migration. New core/node_settings.py.
  • Cluster load-test scripts: scripts/cluster-loadtest.sh and scripts/cluster-loadtest-hi.sh for exercising cross-node dispatch under load.

Changed

  • LLAMAMAN_NODE_NAME is now required for every install, not just clusters: it is the node's stable identity, its per-node settings namespace, and the cluster registry key, so a single-node deployment can later join a cluster without orphaning its state. The app refuses to start (with guidance) when it is unset.
  • Default MTP draft count lowered from 3 to 2: the Speculative Decoding Draft N Max field now defaults to 2 when left blank.
  • Settings UI polish and quality-of-life: verbose inline descriptions were converted to hover info icons (the Fallback-only flag, the admin-UI eviction toggles, and the new cluster monitoring toggle), the Docker Images tab gained a Manage Docker images section heading, the Settings card moved to the top of the dashboard, and assorted spacing and layout were tightened. A new Hide long-offline nodes from resource monitors cluster toggle drops a node that has been silent for over 10 minutes from the System and GPU cards only - it stays listed under Cluster nodes and remains routable.
  • Docs: the README and Docker Hub overview were updated to cover clustering, the request-logging page, and the new CLUSTER_* / LLAMAMAN_NODE_NAME environment variables; the stale 1.0.0 screenshot was removed.

Fixed

  • Cross-node balancing could silently break when a peer was reachable in the database but not over HTTP (e.g. a WSL node advertising a host IP with no port-forward): such a node looks "online" by heartbeat yet can't actually be dispatched to. The Cluster tab now actively probes the dispatch path and flags those nodes as unreachable, so the broken balancing is visible instead of surfacing as stray 504/500s.
  • Auto-restart-on-crash moved off the monitoring tick: opt-in crash recovery now runs on a dedicated, loop-guarded daemon thread, so a restarting instance can't stall the poller - and, by extension, the cluster heartbeat. New tests/test_auto_restart.py.
  • Logging page couldn't scroll: the page is a flex child of a height:100vh / overflow:hidden body and had no inner scroll container, so a tall conversations list was clipped with no way to scroll. .logging-page is now its own scroll region (flex:1; min-height:0; overflow-y:auto), mirroring the dashboard's main column.

1.1.8

21 May 16:23

Choose a tag to compare

Changelog

All notable changes to LlamaMan are documented here.

[1.1.8] - 2026-05-21

Added

  • Per-instance request stats: each running instance card gains a Stats button (chart icon) that opens a modal summarizing that instance's recorded traffic - request count (with errors), average and peak throughput (tokens/s), average time-to-first-token, average latency, prompt/completion/total tokens, and the active time span. The numbers are rolled up from the request log via the new GET /api/request-log/stats endpoint (optional inst_id and window_hours query params; also returns the current recording mode), so the modal shows an empty state prompting you to enable recording when it's off, and stats persist even after the instance is stopped. New request_log_stats() storage method is implemented on both backends (the JSON backend scans on-disk records; MariaDB aggregates with a single GROUP BY). New static/js/stats.js, #stats-modal markup, and .stat-* / .stats-* CSS.
  • Accurate per-turn throughput in the request log: RecordingHandle now records generation-only tokens_per_sec and ttft_ms for each turn. Paths that already measure real generation timing set them explicitly (set_metrics); SSE relay paths mark the first token (mark_first_token) and finalize() derives the metrics over the generation window (excluding prompt evaluation), giving truer numbers than re-deriving from total duration. MariaDB gains tokens_per_sec / ttft_ms columns via schema migration 002 (a no-op for the schema-less JSON backend).
  • Bare-metal deployment support: LlamaMan can now run directly on the host instead of only as a container (e.g. under WSL). The mode is auto-detected (/.dockerenv, /run/.containerenv, then cgroup inspection) and it reaches spawned llama-server containers accordingly - by container name on the Docker network when containerized, or via localhost on their published ports when bare-metal. A new resolve_llama_endpoint() helper centralizes this and is applied at launch, relaunch, orphan adoption, and state restore. New env vars LLAMAMAN_IN_DOCKER (force the mode) and LLAMA_HOST_ADDR (host address for the published ports, default localhost).
  • Speculative decoding (MTP) toggle (#59): a new Speculative Decoding section in the launch form runs the model with MTP speculative decoding (--spec-type draft-mtp), with an optional Draft N Max field mapping to --spec-draft-n-max (blank = llama.cpp's default of 3). Intended for models with MTP heads (e.g. Qwen3.6); other speculative-decoding types can still be configured through Extra Args. Two new config fields, spec_enabled and spec_draft_n_max, are saved in presets and plumbed through launch, restart, and proxy auto-start (_ensure_model_running); build_llama_cmd emits the flags. New f-spec-enabled / f-spec-draft-n-max controls with an updateSpecState() enable/disable handler.

Changed

  • Launch form refactor: the launch settings form is reorganized into clearer grouped sections - the new Speculative Decoding block sits alongside Proxy-Side Sampling Overrides, and Extra Args plus the Share queue / Embedding Model toggles are moved to the bottom of the form.
  • README: removed the "What's New" section (folding its still-relevant items into Features) and documented request recording & stats, bare-metal deployment, the new environment variables, and the CPU-only / GPU Devices behavior.

Fixed

  • GPU Layers = 0 now launches CPU-only: a container previously always received a GPU device request based on the detected vendor, ignoring the GPU Layers value, so 0 still triggered Docker's CDI GPU discovery and failed on hosts without GPU passthrough (e.g. WSL without the NVIDIA Container Toolkit) with "failed to discover GPU vendor from CDI". With GPU Layers 0, no GPU device is attached at all, so the model runs fully on CPU.
  • Instance card GPU label accuracy: the card now labels exactly the GPUs an instance actually uses - none for CPU-only (GPU Layers 0) launches, and the specific indices chosen in GPU Devices otherwise. The literal all typed into the GPU Devices field (matching its placeholder) is normalized to "all GPUs" everywhere - device attachment, ROCm ROCR_VISIBLE_DEVICES, and the label - instead of being treated as a device id.
  • Proxy sampling overrides now apply on the OpenAI-compatible endpoint: /v1/chat/completions (llamaman_v1_chat) was not calling apply_proxy_sampling_overrides, so configured temperature / top-k / top-p / presence / repeat penalties were enforced on the Ollama routes but silently ignored for OpenAI-style requests. The overrides are now applied to the request body before forwarding, matching the other inference paths.

1.1.5

12 May 20:22
725506e

Choose a tag to compare

[1.1.5] - 2026-05-12

Added

  • Download progress bars in the Logs modal: the per-download "Logs" button (now labelled "Progress") opens a modal that leads with an overall progress bar plus one bar per file - for multipart GGUFs (...-00001-of-00008.gguf), which we expand and download in full, every shard gets its own bar showing bytes / total / percent / live speed. Bars turn green when a part is done, grey and frozen on paused/failed/cancelled, and show an indeterminate stripe for an in-progress file whose size isn't yet known; the surrounding error is surfaced inline on failure. Auto-scroll is hidden in this mode (nothing to scroll), and a new "Raw log" toggle in the header reveals the original text log underneath. New .dlp-* CSS, #log-progress container, and #btn-toggle-raw-log control.
  • Structured download progress file: core/downloader.py now writes an atomic, throttled (~2/s) JSON snapshot to dl-<id>.progress.json next to the log - {repo_id, filename, status, error, started_at, updated_at, parts:[{name, index, total, downloaded, size, speed, status}]}. Parts are pre-populated from the HuggingFace file listing (with sizes) before downloading starts; download_file updates its part as it streams, including the resume / "already complete" (HTTP 416) path. Plain-text log output is unchanged. New HF_PROGRESS_FILE env var plumbed through api/downloads.py.
  • GET /api/downloads/<id>/progress: returns the JSON snapshot above merged with the live download_status (downloading / paused / completed / failed / cancelled) so the UI has the authoritative state even if the on-disk file is stale or the subprocess was killed mid-write.
  • Orphan stopped-container cleanup: new opt-in toggle "Remove orphaned stopped containers" under "Auto-clean stopped instances" in Settings → Cleanup (cleanup.orphan_containers_enabled, default off). On the hourly cleanup pass it lists all llamaman-labelled containers (not just running ones) and removes any in exited/dead/created state that have no matching instance record; running/paused ones are left for the existing orphan-adoption scan. Stamps cleanup.orphan_containers_last_run_at. list_llama_containers() gained an all=False parameter to enable the stopped-container sweep.

Changed

  • Downloads panel button: the per-download "Logs" action is now "Progress" (chart icon) - same modal, progress-first.
  • Main view no longer grows a horizontal scrollbar when the settings card is expanded: the cause was the info-tip tooltip pseudo-elements (position: absolute, visibility: hidden, up to 240px wide) on right-column launch-form fields, which still count toward scrollable overflow and poked past the card's right edge. .main now has min-width: 0 and overflow-x: hidden (the clip-right/clip-left hover logic keeps visible tooltips on-screen); .form-grid uses minmax(min(200px, 100%), 1fr) and the max-content toggle-grid tracks (.cleanup-settings-grid, .cleanup-toggle-grid, .launch-toggle-table) are now minmax(0, max-content) so content wraps under pressure instead of forcing overflow.

1.1.2

08 May 18:07

Choose a tag to compare

Changelog

All notable changes to LlamaMan are documented here.

[1.1.2] - 2026-05-08

Added

  • context_length on /api/ps: each loaded model entry now carries a context_length field set to the runtime ctx the instance was launched with (inst["config"]["ctx_size"], baked into --ctx-size at container start), matching real Ollama's /api/ps shape. Falls back to the GGUF <arch>.context_length only if the live config is missing. Lets clients like Hermes see the effective cap (e.g. a preset-imposed 64K) instead of inferring the trained max from the GGUF header (e.g. 256K).
  • GGUF full-metadata reader with shared cache (api/models.py): new get_gguf_full_metadata flat-dict reader plus get_cached_gguf_metadata keyed on (path, mtime, size). Huge tokenizer arrays (tokens, merges, scores, token_type) are skipped via a new _skip_gguf_value walker so trailing scalars (bos_token_id, eos_token_id, tokenizer.chat_template) remain reachable. The existing layer-fit reader get_gguf_metadata is refactored to read from the same cache, so /api/model-layers, /api/ps, and /api/show share a single parse per file.
  • format_param_count / estimate_model_vram helpers (api/models.py): render parameter counts as "8.0B"/"200M" and approximate weight-only VRAM from n_gpu_layers and GGUF block_count (full size when -1, zero when 0, proportional otherwise).
  • GPU temperature on /api/gpu-info: each GPU entry now includes a temperature_c field (integer °C, or null when the source doesn't expose it). Wired through every query path: pynvml.nvmlDeviceGetTemperature for NVIDIA, a new _read_hwmon_temp_c helper that walks /sys/class/drm/card*/device/hwmon/hwmon*/temp*_input and prefers the edge sensor for AMD/Intel sysfs, plus temperature.gpu on the nvidia-smi container-exec fallback and --showtemp parsing on the rocm-smi fallback.
  • GPU temperature display in the system info card: each GPU row's label column now stacks GPU N over the live temperature reading, aligned with the core/VRAM bars. Color thresholds match the existing bar palette: ≥85 °C red, ≥75 °C yellow, otherwise muted; shows - when temperature is unavailable. New .gpu-bar-label-col / .gpu-bar-temp CSS classes.

Changed

  • /api/ps details.family / families now derive from GGUF general.architecture (e.g. llama, qwen2) instead of splitting the filename on -. Falls back to the old heuristic for non-GGUF models or unparseable headers.
  • /api/ps details.parameter_size is now populated from GGUF general.size_label (e.g. "8B"), or falls back to format_param_count(general.parameter_count). Previously always "".
  • /api/ps expires_at now reflects the configured idle reaper deadline: _last_request_at + idle_timeout_min*60 when the timeout is set, and a far-future sentinel (2100-01-01) when idle_timeout_min == 0. Previously hardcoded to started_at + 300s regardless of the actual reaper config.
  • /api/ps size_vram now reports an estimated weight-only VRAM footprint via estimate_model_vram instead of always returning 0. Approximate (does not include KV cache, scratch buffers, or partial-spill fallbacks), but matches the spirit of Ollama's per-model field.
  • /api/show model_info is now populated from the full GGUF header (general.*, <arch>.*, scalar tokenizer fields) instead of three stubbed keys. The <arch>.context_length value is overridden with the effective ctx in this priority: running instance config > saved preset > GGUF default - so callers reading model_info see the cap that would actually apply, not the trained max.
  • /api/show template now exposes tokenizer.chat_template from the GGUF header when present (was always "").
  • /api/tags details uses the same GGUF-sourced family/parameter_size enrichment as /api/ps, via a shared _details_from_gguf helper.

1.1.1

05 May 21:29

Choose a tag to compare

Changelog

All notable changes to LlamaMan are documented here.

[1.1.1] - 2026-05-06

Added

  • Versioned schema migrations (#51): new core/migrations.py runs pending migrations at app startup before any code reads timestamp-affected tables, gated by a backend-provided advisory lock (server-side lock on MariaDB, lockfile on JSON) so multi-worker gunicorn setups serialize cleanly. CURRENT_SCHEMA_VERSION lives in code; the applied version is persisted in settings and re-read inside the lock to skip already-applied migrations. Failures abort startup rather than serving traffic against a half-migrated schema. (49dd044)
  • Centralized timestamp helpers (core/timeutil.py): now_utc, now_iso, to_iso, parse_iso, epoch_ms_to_iso, epoch_s_to_iso. Wire/storage format standardized to ISO 8601 UTC with millisecond precision and a trailing Z (e.g. 2026-05-05T14:32:18.123Z); SQL columns are DATETIME(3), JSON records and API payloads are strings. (49dd044)
  • Migration 001 - timestamp normalization (#51): converts legacy epoch-int created_at values in request_log and api_keys to native datetime / ISO strings on both the JSON and MariaDB backends. (49dd044)
  • Bundled Font Awesome 7.1.0 (#1): icon assets now ship inside the repo at static/fontawesome-free-7.1.0-web/ and are served via Flask's url_for('static', ...), removing the runtime dependency on cdnjs.cloudflare.com. (4eb3a22)
  • Third-party license attribution (#1): README gained a Third-party licenses subsection calling out the bundled Font Awesome (CC BY 4.0 icons, SIL OFL 1.1 fonts, MIT code) and linking to the in-repo LICENSE.txt that ships with it. (a58bcfc)

Changed

  • prune_request_log interface: storage method now accepts a datetime (or ISO string) cutoff instead of older_than_ms: int. The monitoring poller computes datetime.now(timezone.utc) - timedelta(days=N) directly rather than converting to epoch milliseconds. (49dd044)
  • API key created_at is now ISO 8601 UTC (string) rather than an epoch int. UI date parsing in static/js/system.js for both API keys and HuggingFace tokens transparently handles both number and string forms during the migration window. (49dd044)
  • Instance card meta line now displays a truncated container ID (first 12 chars, or - when absent) instead of PID ${inst.pid}, which removed the long-standing PID undefined artifact in the live instance view. (#52) (49dd044)
  • FontAwesome stylesheet source in templates/index.html and templates/login.html swapped from the cdnjs CDN (font-awesome/6.7.2/css/all.min.css) to the locally bundled 7.1.0 stylesheet. The major-version bump (6.7.2 >> 7.1.0) means existing fa-* class names continue to resolve, but any downstream code relying on glyphs renamed/removed between FA majors should be re-checked visually. (4eb3a22)
  • Download / logging settings input width widened from 80px to 138px (.download-settings-input) so the request-logging mode selector and adjacent inputs render their full option text without clipping. (4eb3a22)
  • Version bumped to 1.1.1. (49dd044)

Fixed

  • PID undefined in instance view (#52): the live instances panel no longer surfaces a literal PID undefined segment for instances that lack a pid field. The meta line now resolves to the short container ID (or a dash) instead. (49dd044)
  • Logging mode select sizing in Settings: the input/select previously clipped longer option labels at 80px wide; widened to 138px so the request-logging mode dropdown reads cleanly. (4eb3a22)
  • Offline / air-gapped deployments: removing the cdnjs link means the UI no longer loses its icon set when the host has no outbound internet access (or when cdnjs is blocked). (#1) (4eb3a22)

1.0.0

02 May 22:11
3721ee7

Choose a tag to compare

Changelog

All notable changes to LlamaMan are documented here.

[1.0.0] - 2026-05-02

Added

  • Request logging (#48): optional recording of inference traffic (Ollama and OpenAI APIs plus the per-instance proxy) with off / per_request / per_conversation modes, content-hash conversation grouping, configurable retention, and a new /api/request-log/conversations endpoint backed by both the JSON and MariaDB storage backends. (7895043, b91a611)
  • HuggingFace nested-file and multipart GGUF resolution: download flow now fetches the repo file list and expands a single shard pick (e.g. model-00001-of-00008.gguf) to all shards, and resolves UI-supplied basenames to their nested repo paths before spawning the downloader. (55be8f3)
  • RECORDINGS_DIR env var (#48): lets operators relocate per-conversation request log records on the JSON backend, with documentation of the new request_log/ data directory and request_log MariaDB table. (7895043, b91a611)
  • Launch Instance setting tooltips (#44): each input label in the Launch tab now has a circle-info icon pinned to the right edge of its row that reveals a short description of the setting on hover or focus. Tooltips center on the icon by default and flip to right- or left-anchored placement when the centered position would extend past the viewport. (37ef2d8)
  • Per-instance CPU and RAM usage bars: each running instance card now shows a thin progress bar next to each resource value, reusing the system / GPU-VRAM bar style at a smaller scale (5px tall) with the same >90% red / >70% yellow / else green color thresholds. The RAM bar is only rendered when a Memory Limit is configured (no denominator otherwise). (e00fc69, b146781)

Changed

  • Preset timeout and queue fields (idle_timeout_min, max_concurrent, max_queue_depth, share_queue) now apply to already-running instances on save without requiring a relaunch - the reaper re-reads idle timeout each tick and the request gate is refreshed in place. (#47) (77a8007)
  • Live preset updates now also propagate the six proxy-sampling fields (override_enabled, temperature, top_k, top_p, presence_penalty, repeat_penalty) to running instances. The per-instance proxy and the Ollama/OpenAI compat routes both read these from inst["config"] per request, so the next request after a preset save reflects the new values. Caveat: if the instance was launched with idle_timeout = 0, max_concurrent = 0, and override_enabled = false, no sidecar proxy was spawned, so direct hits to the public port still bypass the override; compat routes still apply it, and a relaunch is required to spawn the proxy retroactively. (6173d7f)
  • CPU usage in the instances panel is now displayed as a percentage of the instance's CPU allotment instead of a raw percentage across all host cores, so a fully-loaded 2-core container reads as 100% / 2 cores rather than 200%. (afeb734)
  • Inactive-instance relaunch now fully reconciles the request gate against the merged config, covering create / refresh / remove / no-op transitions instead of only refreshing an existing gate. (7895043)
  • Subprocess-facing settings are now mirrored to a backend-agnostic snapshot file written by the main process, so download subprocesses pick up live global speed-limit changes within one second on both JSON and MariaDB backends. (#49) (7895043)
  • Version bumped to 0.9.7. (c9f3714)

Fixed

  • Global download speed limit now takes effect on running downloads when using the MariaDB backend - previously the downloader read settings.json directly, which only existed on the JSON backend. (#49) (7895043)

0.9.6

12 Apr 22:24
1a494b4

Choose a tag to compare

Docker-in-Docker architecture #37

LlamaMan no longer bundles or calls llama.cpp directly. Instead it spawns each model server as a sibling Docker container using the official ghcr.io/ggml-org/llama.cpp:server-* images via the Docker socket. This is the foundational change that everything else in this release builds on.

  • LlamaMan is now a lightweight Python-only container with no GPU dependency of its own
  • llama-server containers are created, started, stopped, and removed through the Docker SDK
  • GPU passthrough, port binding, volume mounts, CPU quota, and memory limits are applied per-container at launch time
  • Models volume is passed to sub-containers using a MODELS_HOST_DIR env var that resolves the actual host-side path for the bind mount
  • Backing containers are always cleaned up: stop_container now catches errors from stop() and calls remove() regardless, so already-exited containers don't leave orphaned records

Universal GPU support - single image for all vendors

  • Single Dockerfile - Dockerfile.cuda and Dockerfile.rocm are replaced by one Dockerfile. One image tag covers NVIDIA, AMD (ROCm), Intel Arc, and CPU-only
  • Auto-detection at startup - LlamaMan probes the host: pynvml for NVIDIA, /sys/class/drm sysfs for AMD and Intel Arc. Detected vendor is logged at startup
  • LLAMA_IMAGE auto-default - if the env var is not set, the image is selected from the detected vendor (server-cuda / server-rocm / server-sycl / server)
  • GPU_TYPE override - set to cuda, rocm, or intel to skip auto-detection
  • Intel Arc support - new intel branch in _run_container: mounts /dev/dri, adds video/render groups, uses server-sycl image by default. Per-instance GPU device selection is not supported on Intel Arc
  • Single docker-compose.yml - the separate ROCm profile service is removed; /sys/class/drm:ro mount is included by default; NVIDIA toolkit utility capability block is present as a commented-out section

Native GPU monitoring

  • VRAM and utilization are now queried inside the llamaman container directly - no running llama-server instance required
  • NVIDIA: uses pynvml. Requires uncommenting the deploy.resources.reservations block in docker-compose.yml to grant toolkit utility capability
  • AMD / Intel Arc: reads mem_info_vram_used, mem_info_vram_total, gpu_busy_percent, and product_name from /sys/class/drm sysfs (the :ro mount in the compose file)
  • Falls back to the previous exec-based nvidia-smi / rocm-smi approach when native access is not configured and a container is running
  • The GPU panel no longer returns an error when no llama-server containers are running

Container resource monitoring

  • Each running instance card shows live stats updated every 3 seconds: CPU%, core quota, RAM used / limit, and GPU assignment
  • CPU quota is read from the instance's configured threads value (the Docker nano_cpus setting), not from online_cpus which always reflects the host CPU count
  • GPU assignment is resolved from the instance config against the detected GPU list - no container inspection needed
  • Stats are fetched in parallel via a ThreadPoolExecutor to avoid blocking the UI on slow Docker API calls

Per-container resource limits

  • CPU Threads now applies both --threads N to llama-server and a Docker CPU quota (nano_cpus) to the container, capping the cores it can use. Leave blank for no limit
  • Memory Limit - new field in the launch form (e.g. 32g, 8192m). Sets mem_limit on the spawned container. Saved in presets. Leave blank for no limit

Docker image management

  • Pull image by name - new text input in the Docker Images tab lets you pull any image by name directly (e.g. ghcr.io/ggml-org/llama.cpp:server-cuda) without it needing to be in the tracked list first
  • Delete local image - each image in the list now has a delete button that removes it from Docker and from the tracked list. Disabled for the active LLAMA_IMAGE. Returns an error if Docker refuses (e.g. image in use by a running container)

Model backup and restore #39

  • Download Stored Models JSON - exports all scanned models with their preset configs to a timestamped JSON file
  • Restore from JSON - upload a previously exported backup. For each model in the file:
    • Already present on disk: preset is merged in (existing values are not overwritten)
    • Not present but has a HuggingFace source: download is queued immediately and preset is pre-populated at the expected post-download path so it is ready when the file lands
    • Not present and no known source: reported as unrestorable
  • Results are shown inline with per-model status badges (present / queued / missing / error)

Repeat penalty in proxy sampling overrides

  • New Repeat Penalty field in the per-instance proxy sampling overrides section
  • Default 0 (disabled - not injected into requests). Range 02.0
  • Only injected into proxied requests when set above 0, so leaving it at the default has no effect on clients that set their own value

0.8.9-4

08 Apr 14:24
1ee7f9a

Choose a tag to compare

  • Display source repository info on model cards (#40)
    Added UI support for showing the HuggingFace repo_id a model was downloaded from. CSS, JS, and template changes only - no backend changes

  • Fix per-instance proxy blocking on request body read
    _extract_model_from_request was calling wsgi.input.read() with no argument, which reads the raw socket until EOF (blocking until the client disconnects). Fixed by reading exactly CONTENT_LENGTH bytes, so the proxy no longer hangs waiting for the connection to close before forwarding

  • Model name validation for healthy instances on per-instance proxy ports
    Previously, model name validation (returning 404 for a mismatched "model" field) only ran for sleeping/stopped instances. Extended the check to run after all wake/wait logic so healthy and starting instances are validated consistently - sending the wrong model name to a port always returns 404 regardless of instance state

  • Docs and version bump (0.8.9-4) README and DOCKERHUB.md updates covering: per-instance proxy behavior and model validation rules, MariaDB/MySQL setup snippet with CREATE DATABASE/CREATE USER/GRANT commands, and minor docker-compose correction

0.8.9

07 Apr 15:34
0236b5e

Choose a tag to compare

Model Favorites & Notes (#35)

  • Star/favorite models in the sidebar model library - click the star icon to mark favorites, which sort alphabetically at the top of the list
  • Favorite toggle in settings - a star button appears in the Launch Instance tab bar (far right) for quick access
  • Model notes - a new "Note" text field in the Launch Instance form lets you add a note to any model, saved automatically on blur
  • Favorites and notes are stored as part of model presets and persist across sessions
  • Added PATCH /api/presets/<path> endpoint for lightweight partial preset updates (favorite/note only, no full preset required)

Proxy Wake-on-Request by Model Name (#36)

  • Fixed: when sending an OpenAI API request directly to a sleeping instance's port (e.g. POST http://localhost:8000/v1/chat/completions), the idle proxy now inspects the model field in the request body and wakes the sleeping instance if the model matches
  • If the requested model doesn't match the sleeping instance, the proxy returns a clear 404 error instead of a generic failure
  • If the original instance record is gone but a sleeping instance with a matching model exists on that port, the proxy finds and wakes it
  • Non-inference requests (health checks, etc.) continue to wake the instance unconditionally
  • The main llamaman proxy on port 42069 is unaffected - all changes are scoped to the per-instance idle proxy (ports 8000-8020)

0.8.7

01 Apr 15:40

Choose a tag to compare

  • add embeddings endpoint (#32)
  • add standard API embeddings guard (#32)
  • auto add --embeddings server option on UI embedding model option toggle