Releases: nullata/llamaMan
1.2.1
[1.2.1] - 2026-06-02
Added
- Multi-node clustering: several LlamaMan deployments can now run as one logical cluster. Clustering is opt-in (
CLUSTER_ENABLEDplus a sharedCLUSTER_SECRET) and entirely inert for single-node installs. Nodes discover each other through the database storage backend's shared node registry (register_node/list_nodes) rather than pairwise configuration, so any node added anywhere becomes visible to all; every node-to-node call carries the secret in anX-Cluster-Secretheader (never as a client bearer token), and each node advertises how peers reach it viaCLUSTER_ADVERTISE_URL. The dashboard gains per-node System and GPU monitoring cards and a Cluster settings tab showing each node's identity, advertise URL, heartbeat age (stamped on the database's clock so node-to-node skew can't flap a healthy node offline), and an HTTP-reachability badge. A control relay (/api/cluster/nodes/<id>/proxy/...) lets the UI drive launches, image pulls, and downloads on any node, with node selectors added to the launch, images, and downloads forms. Newapi/cluster.py,core/cluster.py,static/js/cluster.js, and an extensivetests/test_cluster.py. - Shared inference queue and cross-node least-load dispatch: with Share queue with same model enabled, instances of a model across nodes form a group, and an inference request is routed to the group node with the fewest in-flight requests; a queued request can migrate to a freer peer, with the hop chain bounded by
MAX_HOPSand guarded against loops. An optional Queue group name pools same-family / different-quant instances under one alias (which also becomes the llama-server--aliasthe instance advertises), and a Fallback only flag marks instances that should serve only when every non-fallback member of the group is at capacity or unreachable. The cluster heartbeat runs on its own dedicated thread (CLUSTER_HEARTBEAT_INTERVAL_S) so forwarded inference on the shared worker can't starve it and flap nodes offline. - Request logging page (
/logging, linked from the dashboard header): a dedicated page that rolls up recorded inference traffic into summary tiles (request count with errors, average and peak throughput, average TTFT and latency, prompt/completion/total tokens), a time-window selector (24h / 7d / 30d / all), a recent-conversations list, and a per-conversation drill-down showing each turn's prompt/response and metrics. Newtemplates/logging.htmlandstatic/js/logging.js. - Per-node settings: settings that must differ per host - the node's Docker images and the model-cap eviction toggles - are now scoped under
settings["nodes"][<node_id>]instead of being shared cluster-wide, while reads transparently fall back to the legacy top-level value so existing single-node installs upgrade with zero migration. Newcore/node_settings.py. - Cluster load-test scripts:
scripts/cluster-loadtest.shandscripts/cluster-loadtest-hi.shfor exercising cross-node dispatch under load.
Changed
LLAMAMAN_NODE_NAMEis now required for every install, not just clusters: it is the node's stable identity, its per-node settings namespace, and the cluster registry key, so a single-node deployment can later join a cluster without orphaning its state. The app refuses to start (with guidance) when it is unset.- Default MTP draft count lowered from 3 to 2: the Speculative Decoding Draft N Max field now defaults to 2 when left blank.
- Settings UI polish and quality-of-life: verbose inline descriptions were converted to hover info icons (the Fallback-only flag, the admin-UI eviction toggles, and the new cluster monitoring toggle), the Docker Images tab gained a Manage Docker images section heading, the Settings card moved to the top of the dashboard, and assorted spacing and layout were tightened. A new Hide long-offline nodes from resource monitors cluster toggle drops a node that has been silent for over 10 minutes from the System and GPU cards only - it stays listed under Cluster nodes and remains routable.
- Docs: the README and Docker Hub overview were updated to cover clustering, the request-logging page, and the new
CLUSTER_*/LLAMAMAN_NODE_NAMEenvironment variables; the stale 1.0.0 screenshot was removed.
Fixed
- Cross-node balancing could silently break when a peer was reachable in the database but not over HTTP (e.g. a WSL node advertising a host IP with no port-forward): such a node looks "online" by heartbeat yet can't actually be dispatched to. The Cluster tab now actively probes the dispatch path and flags those nodes as unreachable, so the broken balancing is visible instead of surfacing as stray 504/500s.
- Auto-restart-on-crash moved off the monitoring tick: opt-in crash recovery now runs on a dedicated, loop-guarded daemon thread, so a restarting instance can't stall the poller - and, by extension, the cluster heartbeat. New
tests/test_auto_restart.py. - Logging page couldn't scroll: the page is a flex child of a
height:100vh/overflow:hiddenbody and had no inner scroll container, so a tall conversations list was clipped with no way to scroll..logging-pageis now its own scroll region (flex:1; min-height:0; overflow-y:auto), mirroring the dashboard's main column.
1.1.8
Changelog
All notable changes to LlamaMan are documented here.
[1.1.8] - 2026-05-21
Added
- Per-instance request stats: each running instance card gains a Stats button (chart icon) that opens a modal summarizing that instance's recorded traffic - request count (with errors), average and peak throughput (tokens/s), average time-to-first-token, average latency, prompt/completion/total tokens, and the active time span. The numbers are rolled up from the request log via the new
GET /api/request-log/statsendpoint (optionalinst_idandwindow_hoursquery params; also returns the currentrecordingmode), so the modal shows an empty state prompting you to enable recording when it's off, and stats persist even after the instance is stopped. Newrequest_log_stats()storage method is implemented on both backends (the JSON backend scans on-disk records; MariaDB aggregates with a singleGROUP BY). Newstatic/js/stats.js,#stats-modalmarkup, and.stat-*/.stats-*CSS. - Accurate per-turn throughput in the request log:
RecordingHandlenow records generation-onlytokens_per_secandttft_msfor each turn. Paths that already measure real generation timing set them explicitly (set_metrics); SSE relay paths mark the first token (mark_first_token) andfinalize()derives the metrics over the generation window (excluding prompt evaluation), giving truer numbers than re-deriving from total duration. MariaDB gainstokens_per_sec/ttft_mscolumns via schema migration 002 (a no-op for the schema-less JSON backend). - Bare-metal deployment support: LlamaMan can now run directly on the host instead of only as a container (e.g. under WSL). The mode is auto-detected (
/.dockerenv,/run/.containerenv, then cgroup inspection) and it reaches spawned llama-server containers accordingly - by container name on the Docker network when containerized, or vialocalhoston their published ports when bare-metal. A newresolve_llama_endpoint()helper centralizes this and is applied at launch, relaunch, orphan adoption, and state restore. New env varsLLAMAMAN_IN_DOCKER(force the mode) andLLAMA_HOST_ADDR(host address for the published ports, defaultlocalhost). - Speculative decoding (MTP) toggle (#59): a new Speculative Decoding section in the launch form runs the model with MTP speculative decoding (
--spec-type draft-mtp), with an optional Draft N Max field mapping to--spec-draft-n-max(blank = llama.cpp's default of 3). Intended for models with MTP heads (e.g. Qwen3.6); other speculative-decoding types can still be configured through Extra Args. Two new config fields,spec_enabledandspec_draft_n_max, are saved in presets and plumbed through launch, restart, and proxy auto-start (_ensure_model_running);build_llama_cmdemits the flags. Newf-spec-enabled/f-spec-draft-n-maxcontrols with anupdateSpecState()enable/disable handler.
Changed
- Launch form refactor: the launch settings form is reorganized into clearer grouped sections - the new Speculative Decoding block sits alongside Proxy-Side Sampling Overrides, and Extra Args plus the Share queue / Embedding Model toggles are moved to the bottom of the form.
- README: removed the "What's New" section (folding its still-relevant items into Features) and documented request recording & stats, bare-metal deployment, the new environment variables, and the CPU-only / GPU Devices behavior.
Fixed
- GPU Layers = 0 now launches CPU-only: a container previously always received a GPU device request based on the detected vendor, ignoring the GPU Layers value, so
0still triggered Docker's CDI GPU discovery and failed on hosts without GPU passthrough (e.g. WSL without the NVIDIA Container Toolkit) with "failed to discover GPU vendor from CDI". With GPU Layers0, no GPU device is attached at all, so the model runs fully on CPU. - Instance card GPU label accuracy: the card now labels exactly the GPUs an instance actually uses - none for CPU-only (GPU Layers
0) launches, and the specific indices chosen in GPU Devices otherwise. The literalalltyped into the GPU Devices field (matching its placeholder) is normalized to "all GPUs" everywhere - device attachment, ROCmROCR_VISIBLE_DEVICES, and the label - instead of being treated as a device id. - Proxy sampling overrides now apply on the OpenAI-compatible endpoint:
/v1/chat/completions(llamaman_v1_chat) was not callingapply_proxy_sampling_overrides, so configured temperature / top-k / top-p / presence / repeat penalties were enforced on the Ollama routes but silently ignored for OpenAI-style requests. The overrides are now applied to the request body before forwarding, matching the other inference paths.
1.1.5
[1.1.5] - 2026-05-12
Added
- Download progress bars in the Logs modal: the per-download "Logs" button (now labelled "Progress") opens a modal that leads with an overall progress bar plus one bar per file - for multipart GGUFs (
...-00001-of-00008.gguf), which we expand and download in full, every shard gets its own bar showing bytes / total / percent / live speed. Bars turn green when a part isdone, grey and frozen onpaused/failed/cancelled, and show an indeterminate stripe for an in-progress file whose size isn't yet known; the surrounding error is surfaced inline on failure. Auto-scroll is hidden in this mode (nothing to scroll), and a new "Raw log" toggle in the header reveals the original text log underneath. New.dlp-*CSS,#log-progresscontainer, and#btn-toggle-raw-logcontrol. - Structured download progress file:
core/downloader.pynow writes an atomic, throttled (~2/s) JSON snapshot todl-<id>.progress.jsonnext to the log -{repo_id, filename, status, error, started_at, updated_at, parts:[{name, index, total, downloaded, size, speed, status}]}. Parts are pre-populated from the HuggingFace file listing (with sizes) before downloading starts;download_fileupdates its part as it streams, including the resume / "already complete" (HTTP 416) path. Plain-text log output is unchanged. NewHF_PROGRESS_FILEenv var plumbed throughapi/downloads.py. GET /api/downloads/<id>/progress: returns the JSON snapshot above merged with the livedownload_status(downloading/paused/completed/failed/cancelled) so the UI has the authoritative state even if the on-disk file is stale or the subprocess was killed mid-write.- Orphan stopped-container cleanup: new opt-in toggle "Remove orphaned stopped containers" under "Auto-clean stopped instances" in Settings → Cleanup (
cleanup.orphan_containers_enabled, default off). On the hourly cleanup pass it lists all llamaman-labelled containers (not just running ones) and removes any inexited/dead/createdstate that have no matching instance record; running/paused ones are left for the existing orphan-adoption scan. Stampscleanup.orphan_containers_last_run_at.list_llama_containers()gained anall=Falseparameter to enable the stopped-container sweep.
Changed
- Downloads panel button: the per-download "Logs" action is now "Progress" (chart icon) - same modal, progress-first.
- Main view no longer grows a horizontal scrollbar when the settings card is expanded: the cause was the info-tip tooltip pseudo-elements (
position: absolute,visibility: hidden, up to 240px wide) on right-column launch-form fields, which still count toward scrollable overflow and poked past the card's right edge..mainnow hasmin-width: 0andoverflow-x: hidden(theclip-right/clip-lefthover logic keeps visible tooltips on-screen);.form-gridusesminmax(min(200px, 100%), 1fr)and themax-contenttoggle-grid tracks (.cleanup-settings-grid,.cleanup-toggle-grid,.launch-toggle-table) are nowminmax(0, max-content)so content wraps under pressure instead of forcing overflow.
1.1.2
Changelog
All notable changes to LlamaMan are documented here.
[1.1.2] - 2026-05-08
Added
context_lengthon/api/ps: each loaded model entry now carries acontext_lengthfield set to the runtime ctx the instance was launched with (inst["config"]["ctx_size"], baked into--ctx-sizeat container start), matching real Ollama's/api/psshape. Falls back to the GGUF<arch>.context_lengthonly if the live config is missing. Lets clients like Hermes see the effective cap (e.g. a preset-imposed 64K) instead of inferring the trained max from the GGUF header (e.g. 256K).- GGUF full-metadata reader with shared cache (
api/models.py): newget_gguf_full_metadataflat-dict reader plusget_cached_gguf_metadatakeyed on(path, mtime, size). Huge tokenizer arrays (tokens,merges,scores,token_type) are skipped via a new_skip_gguf_valuewalker so trailing scalars (bos_token_id,eos_token_id,tokenizer.chat_template) remain reachable. The existing layer-fit readerget_gguf_metadatais refactored to read from the same cache, so/api/model-layers,/api/ps, and/api/showshare a single parse per file. format_param_count/estimate_model_vramhelpers (api/models.py): render parameter counts as"8.0B"/"200M"and approximate weight-only VRAM fromn_gpu_layersand GGUFblock_count(full size when-1, zero when0, proportional otherwise).- GPU temperature on
/api/gpu-info: each GPU entry now includes atemperature_cfield (integer °C, ornullwhen the source doesn't expose it). Wired through every query path:pynvml.nvmlDeviceGetTemperaturefor NVIDIA, a new_read_hwmon_temp_chelper that walks/sys/class/drm/card*/device/hwmon/hwmon*/temp*_inputand prefers theedgesensor for AMD/Intel sysfs, plustemperature.gpuon the nvidia-smi container-exec fallback and--showtempparsing on the rocm-smi fallback. - GPU temperature display in the system info card: each GPU row's label column now stacks
GPU Nover the live temperature reading, aligned with the core/VRAM bars. Color thresholds match the existing bar palette: ≥85 °C red, ≥75 °C yellow, otherwise muted; shows-when temperature is unavailable. New.gpu-bar-label-col/.gpu-bar-tempCSS classes.
Changed
/api/psdetails.family/familiesnow derive from GGUFgeneral.architecture(e.g.llama,qwen2) instead of splitting the filename on-. Falls back to the old heuristic for non-GGUF models or unparseable headers./api/psdetails.parameter_sizeis now populated from GGUFgeneral.size_label(e.g."8B"), or falls back toformat_param_count(general.parameter_count). Previously always""./api/psexpires_atnow reflects the configured idle reaper deadline:_last_request_at + idle_timeout_min*60when the timeout is set, and a far-future sentinel (2100-01-01) whenidle_timeout_min == 0. Previously hardcoded tostarted_at + 300sregardless of the actual reaper config./api/pssize_vramnow reports an estimated weight-only VRAM footprint viaestimate_model_vraminstead of always returning0. Approximate (does not include KV cache, scratch buffers, or partial-spill fallbacks), but matches the spirit of Ollama's per-model field./api/showmodel_infois now populated from the full GGUF header (general.*,<arch>.*, scalar tokenizer fields) instead of three stubbed keys. The<arch>.context_lengthvalue is overridden with the effective ctx in this priority: running instance config > saved preset > GGUF default - so callers readingmodel_infosee the cap that would actually apply, not the trained max./api/showtemplatenow exposestokenizer.chat_templatefrom the GGUF header when present (was always"")./api/tagsdetailsuses the same GGUF-sourced family/parameter_size enrichment as/api/ps, via a shared_details_from_ggufhelper.
1.1.1
Changelog
All notable changes to LlamaMan are documented here.
[1.1.1] - 2026-05-06
Added
- Versioned schema migrations (#51): new
core/migrations.pyruns pending migrations at app startup before any code reads timestamp-affected tables, gated by a backend-provided advisory lock (server-side lock on MariaDB, lockfile on JSON) so multi-worker gunicorn setups serialize cleanly.CURRENT_SCHEMA_VERSIONlives in code; the applied version is persisted in settings and re-read inside the lock to skip already-applied migrations. Failures abort startup rather than serving traffic against a half-migrated schema. (49dd044) - Centralized timestamp helpers (
core/timeutil.py):now_utc,now_iso,to_iso,parse_iso,epoch_ms_to_iso,epoch_s_to_iso. Wire/storage format standardized to ISO 8601 UTC with millisecond precision and a trailingZ(e.g.2026-05-05T14:32:18.123Z); SQL columns areDATETIME(3), JSON records and API payloads are strings. (49dd044) - Migration 001 - timestamp normalization (#51): converts legacy epoch-int
created_atvalues inrequest_logandapi_keysto nativedatetime/ ISO strings on both the JSON and MariaDB backends. (49dd044) - Bundled Font Awesome 7.1.0 (#1): icon assets now ship inside the repo at
static/fontawesome-free-7.1.0-web/and are served via Flask'surl_for('static', ...), removing the runtime dependency oncdnjs.cloudflare.com. (4eb3a22) - Third-party license attribution (#1): README gained a Third-party licenses subsection calling out the bundled Font Awesome (CC BY 4.0 icons, SIL OFL 1.1 fonts, MIT code) and linking to the in-repo
LICENSE.txtthat ships with it. (a58bcfc)
Changed
prune_request_loginterface: storage method now accepts adatetime(or ISO string) cutoff instead ofolder_than_ms: int. The monitoring poller computesdatetime.now(timezone.utc) - timedelta(days=N)directly rather than converting to epoch milliseconds. (49dd044)- API key
created_atis now ISO 8601 UTC (string) rather than an epoch int. UI date parsing instatic/js/system.jsfor both API keys and HuggingFace tokens transparently handles both number and string forms during the migration window. (49dd044) - Instance card meta line now displays a truncated container ID (first 12 chars, or
-when absent) instead ofPID ${inst.pid}, which removed the long-standingPID undefinedartifact in the live instance view. (#52) (49dd044) - FontAwesome stylesheet source in
templates/index.htmlandtemplates/login.htmlswapped from the cdnjs CDN (font-awesome/6.7.2/css/all.min.css) to the locally bundled 7.1.0 stylesheet. The major-version bump (6.7.2 >> 7.1.0) means existingfa-*class names continue to resolve, but any downstream code relying on glyphs renamed/removed between FA majors should be re-checked visually. (4eb3a22) - Download / logging settings input width widened from 80px to 138px (
.download-settings-input) so the request-logging mode selector and adjacent inputs render their full option text without clipping. (4eb3a22) - Version bumped to 1.1.1. (49dd044)
Fixed
PID undefinedin instance view (#52): the live instances panel no longer surfaces a literalPID undefinedsegment for instances that lack apidfield. The meta line now resolves to the short container ID (or a dash) instead. (49dd044)- Logging mode select sizing in Settings: the input/select previously clipped longer option labels at 80px wide; widened to 138px so the request-logging mode dropdown reads cleanly. (4eb3a22)
- Offline / air-gapped deployments: removing the cdnjs link means the UI no longer loses its icon set when the host has no outbound internet access (or when cdnjs is blocked). (#1) (4eb3a22)
1.0.0
Changelog
All notable changes to LlamaMan are documented here.
[1.0.0] - 2026-05-02
Added
- Request logging (#48): optional recording of inference traffic (Ollama and OpenAI APIs plus the per-instance proxy) with
off/per_request/per_conversationmodes, content-hash conversation grouping, configurable retention, and a new/api/request-log/conversationsendpoint backed by both the JSON and MariaDB storage backends. (7895043, b91a611) - HuggingFace nested-file and multipart GGUF resolution: download flow now fetches the repo file list and expands a single shard pick (e.g.
model-00001-of-00008.gguf) to all shards, and resolves UI-supplied basenames to their nested repo paths before spawning the downloader. (55be8f3) RECORDINGS_DIRenv var (#48): lets operators relocate per-conversation request log records on the JSON backend, with documentation of the newrequest_log/data directory andrequest_logMariaDB table. (7895043, b91a611)- Launch Instance setting tooltips (#44): each input label in the Launch tab now has a circle-info icon pinned to the right edge of its row that reveals a short description of the setting on hover or focus. Tooltips center on the icon by default and flip to right- or left-anchored placement when the centered position would extend past the viewport. (37ef2d8)
- Per-instance CPU and RAM usage bars: each running instance card now shows a thin progress bar next to each resource value, reusing the system / GPU-VRAM bar style at a smaller scale (5px tall) with the same
>90% red / >70% yellow / else greencolor thresholds. The RAM bar is only rendered when a Memory Limit is configured (no denominator otherwise). (e00fc69, b146781)
Changed
- Preset timeout and queue fields (
idle_timeout_min,max_concurrent,max_queue_depth,share_queue) now apply to already-running instances on save without requiring a relaunch - the reaper re-reads idle timeout each tick and the request gate is refreshed in place. (#47) (77a8007) - Live preset updates now also propagate the six proxy-sampling fields (
override_enabled,temperature,top_k,top_p,presence_penalty,repeat_penalty) to running instances. The per-instance proxy and the Ollama/OpenAI compat routes both read these frominst["config"]per request, so the next request after a preset save reflects the new values. Caveat: if the instance was launched withidle_timeout = 0,max_concurrent = 0, andoverride_enabled = false, no sidecar proxy was spawned, so direct hits to the public port still bypass the override; compat routes still apply it, and a relaunch is required to spawn the proxy retroactively. (6173d7f) - CPU usage in the instances panel is now displayed as a percentage of the instance's CPU allotment instead of a raw percentage across all host cores, so a fully-loaded 2-core container reads as
100% / 2 coresrather than200%. (afeb734) - Inactive-instance relaunch now fully reconciles the request gate against the merged config, covering create / refresh / remove / no-op transitions instead of only refreshing an existing gate. (7895043)
- Subprocess-facing settings are now mirrored to a backend-agnostic snapshot file written by the main process, so download subprocesses pick up live global speed-limit changes within one second on both JSON and MariaDB backends. (#49) (7895043)
- Version bumped to 0.9.7. (c9f3714)
Fixed
0.9.6
Docker-in-Docker architecture #37
LlamaMan no longer bundles or calls llama.cpp directly. Instead it spawns each model server as a sibling Docker container using the official ghcr.io/ggml-org/llama.cpp:server-* images via the Docker socket. This is the foundational change that everything else in this release builds on.
- LlamaMan is now a lightweight Python-only container with no GPU dependency of its own
- llama-server containers are created, started, stopped, and removed through the Docker SDK
- GPU passthrough, port binding, volume mounts, CPU quota, and memory limits are applied per-container at launch time
- Models volume is passed to sub-containers using a
MODELS_HOST_DIRenv var that resolves the actual host-side path for the bind mount - Backing containers are always cleaned up:
stop_containernow catches errors fromstop()and callsremove()regardless, so already-exited containers don't leave orphaned records
Universal GPU support - single image for all vendors
- Single
Dockerfile-Dockerfile.cudaandDockerfile.rocmare replaced by oneDockerfile. One image tag covers NVIDIA, AMD (ROCm), Intel Arc, and CPU-only - Auto-detection at startup - LlamaMan probes the host: pynvml for NVIDIA,
/sys/class/drmsysfs for AMD and Intel Arc. Detected vendor is logged at startup LLAMA_IMAGEauto-default - if the env var is not set, the image is selected from the detected vendor (server-cuda/server-rocm/server-sycl/server)GPU_TYPEoverride - set tocuda,rocm, orintelto skip auto-detection- Intel Arc support - new
intelbranch in_run_container: mounts/dev/dri, addsvideo/rendergroups, usesserver-syclimage by default. Per-instance GPU device selection is not supported on Intel Arc - Single
docker-compose.yml- the separate ROCm profile service is removed;/sys/class/drm:romount is included by default; NVIDIA toolkitutilitycapability block is present as a commented-out section
Native GPU monitoring
- VRAM and utilization are now queried inside the llamaman container directly - no running llama-server instance required
- NVIDIA: uses
pynvml. Requires uncommenting thedeploy.resources.reservationsblock indocker-compose.ymlto grant toolkitutilitycapability - AMD / Intel Arc: reads
mem_info_vram_used,mem_info_vram_total,gpu_busy_percent, andproduct_namefrom/sys/class/drmsysfs (the:romount in the compose file) - Falls back to the previous
exec-basednvidia-smi/rocm-smiapproach when native access is not configured and a container is running - The GPU panel no longer returns an error when no llama-server containers are running
Container resource monitoring
- Each running instance card shows live stats updated every 3 seconds: CPU%, core quota, RAM used / limit, and GPU assignment
- CPU quota is read from the instance's configured
threadsvalue (the Dockernano_cpussetting), not fromonline_cpuswhich always reflects the host CPU count - GPU assignment is resolved from the instance config against the detected GPU list - no container inspection needed
- Stats are fetched in parallel via a
ThreadPoolExecutorto avoid blocking the UI on slow Docker API calls
Per-container resource limits
- CPU Threads now applies both
--threads Nto llama-server and a Docker CPU quota (nano_cpus) to the container, capping the cores it can use. Leave blank for no limit - Memory Limit - new field in the launch form (e.g.
32g,8192m). Setsmem_limiton the spawned container. Saved in presets. Leave blank for no limit
Docker image management
- Pull image by name - new text input in the Docker Images tab lets you pull any image by name directly (e.g.
ghcr.io/ggml-org/llama.cpp:server-cuda) without it needing to be in the tracked list first - Delete local image - each image in the list now has a delete button that removes it from Docker and from the tracked list. Disabled for the active
LLAMA_IMAGE. Returns an error if Docker refuses (e.g. image in use by a running container)
Model backup and restore #39
- Download Stored Models JSON - exports all scanned models with their preset configs to a timestamped JSON file
- Restore from JSON - upload a previously exported backup. For each model in the file:
- Already present on disk: preset is merged in (existing values are not overwritten)
- Not present but has a HuggingFace source: download is queued immediately and preset is pre-populated at the expected post-download path so it is ready when the file lands
- Not present and no known source: reported as unrestorable
- Results are shown inline with per-model status badges (present / queued / missing / error)
Repeat penalty in proxy sampling overrides
- New Repeat Penalty field in the per-instance proxy sampling overrides section
- Default
0(disabled - not injected into requests). Range0–2.0 - Only injected into proxied requests when set above
0, so leaving it at the default has no effect on clients that set their own value
0.8.9-4
-
Display source repository info on model cards (#40)
Added UI support for showing the HuggingFace repo_id a model was downloaded from. CSS, JS, and template changes only - no backend changes -
Fix per-instance proxy blocking on request body read
_extract_model_from_request was calling wsgi.input.read() with no argument, which reads the raw socket until EOF (blocking until the client disconnects). Fixed by reading exactly CONTENT_LENGTH bytes, so the proxy no longer hangs waiting for the connection to close before forwarding -
Model name validation for healthy instances on per-instance proxy ports
Previously, model name validation (returning 404 for a mismatched "model" field) only ran for sleeping/stopped instances. Extended the check to run after all wake/wait logic so healthy and starting instances are validated consistently - sending the wrong model name to a port always returns 404 regardless of instance state -
Docs and version bump (0.8.9-4) README and DOCKERHUB.md updates covering: per-instance proxy behavior and model validation rules, MariaDB/MySQL setup snippet with CREATE DATABASE/CREATE USER/GRANT commands, and minor docker-compose correction
0.8.9
Model Favorites & Notes (#35)
- Star/favorite models in the sidebar model library - click the star icon to mark favorites, which sort alphabetically at the top of the list
- Favorite toggle in settings - a star button appears in the Launch Instance tab bar (far right) for quick access
- Model notes - a new "Note" text field in the Launch Instance form lets you add a note to any model, saved automatically on blur
- Favorites and notes are stored as part of model presets and persist across sessions
- Added
PATCH /api/presets/<path>endpoint for lightweight partial preset updates (favorite/note only, no full preset required)
Proxy Wake-on-Request by Model Name (#36)
- Fixed: when sending an OpenAI API request directly to a sleeping instance's port (e.g.
POST http://localhost:8000/v1/chat/completions), the idle proxy now inspects themodelfield in the request body and wakes the sleeping instance if the model matches - If the requested model doesn't match the sleeping instance, the proxy returns a clear
404error instead of a generic failure - If the original instance record is gone but a sleeping instance with a matching model exists on that port, the proxy finds and wakes it
- Non-inference requests (health checks, etc.) continue to wake the instance unconditionally
- The main llamaman proxy on port 42069 is unaffected - all changes are scoped to the per-instance idle proxy (ports 8000-8020)