You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Multi-node clustering: several LlamaMan deployments can now run as one logical cluster. Clustering is opt-in (CLUSTER_ENABLED plus a shared CLUSTER_SECRET) and entirely inert for single-node installs. Nodes discover each other through the database storage backend's shared node registry (register_node / list_nodes) rather than pairwise configuration, so any node added anywhere becomes visible to all; every node-to-node call carries the secret in an X-Cluster-Secret header (never as a client bearer token), and each node advertises how peers reach it via CLUSTER_ADVERTISE_URL. The dashboard gains per-node System and GPU monitoring cards and a Cluster settings tab showing each node's identity, advertise URL, heartbeat age (stamped on the database's clock so node-to-node skew can't flap a healthy node offline), and an HTTP-reachability badge. A control relay (/api/cluster/nodes/<id>/proxy/...) lets the UI drive launches, image pulls, and downloads on any node, with node selectors added to the launch, images, and downloads forms. New api/cluster.py, core/cluster.py, static/js/cluster.js, and an extensive tests/test_cluster.py.
Shared inference queue and cross-node least-load dispatch: with Share queue with same model enabled, instances of a model across nodes form a group, and an inference request is routed to the group node with the fewest in-flight requests; a queued request can migrate to a freer peer, with the hop chain bounded by MAX_HOPS and guarded against loops. An optional Queue group name pools same-family / different-quant instances under one alias (which also becomes the llama-server --alias the instance advertises), and a Fallback only flag marks instances that should serve only when every non-fallback member of the group is at capacity or unreachable. The cluster heartbeat runs on its own dedicated thread (CLUSTER_HEARTBEAT_INTERVAL_S) so forwarded inference on the shared worker can't starve it and flap nodes offline.
Request logging page (/logging, linked from the dashboard header): a dedicated page that rolls up recorded inference traffic into summary tiles (request count with errors, average and peak throughput, average TTFT and latency, prompt/completion/total tokens), a time-window selector (24h / 7d / 30d / all), a recent-conversations list, and a per-conversation drill-down showing each turn's prompt/response and metrics. New templates/logging.html and static/js/logging.js.
Per-node settings: settings that must differ per host - the node's Docker images and the model-cap eviction toggles - are now scoped under settings["nodes"][<node_id>] instead of being shared cluster-wide, while reads transparently fall back to the legacy top-level value so existing single-node installs upgrade with zero migration. New core/node_settings.py.
Cluster load-test scripts: scripts/cluster-loadtest.sh and scripts/cluster-loadtest-hi.sh for exercising cross-node dispatch under load.
Changed
LLAMAMAN_NODE_NAME is now required for every install, not just clusters: it is the node's stable identity, its per-node settings namespace, and the cluster registry key, so a single-node deployment can later join a cluster without orphaning its state. The app refuses to start (with guidance) when it is unset.
Default MTP draft count lowered from 3 to 2: the Speculative Decoding Draft N Max field now defaults to 2 when left blank.
Settings UI polish and quality-of-life: verbose inline descriptions were converted to hover info icons (the Fallback-only flag, the admin-UI eviction toggles, and the new cluster monitoring toggle), the Docker Images tab gained a Manage Docker images section heading, the Settings card moved to the top of the dashboard, and assorted spacing and layout were tightened. A new Hide long-offline nodes from resource monitors cluster toggle drops a node that has been silent for over 10 minutes from the System and GPU cards only - it stays listed under Cluster nodes and remains routable.
Docs: the README and Docker Hub overview were updated to cover clustering, the request-logging page, and the new CLUSTER_* / LLAMAMAN_NODE_NAME environment variables; the stale 1.0.0 screenshot was removed.
Fixed
Cross-node balancing could silently break when a peer was reachable in the database but not over HTTP (e.g. a WSL node advertising a host IP with no port-forward): such a node looks "online" by heartbeat yet can't actually be dispatched to. The Cluster tab now actively probes the dispatch path and flags those nodes as unreachable, so the broken balancing is visible instead of surfacing as stray 504/500s.
Auto-restart-on-crash moved off the monitoring tick: opt-in crash recovery now runs on a dedicated, loop-guarded daemon thread, so a restarting instance can't stall the poller - and, by extension, the cluster heartbeat. New tests/test_auto_restart.py.
Logging page couldn't scroll: the page is a flex child of a height:100vh / overflow:hidden body and had no inner scroll container, so a tall conversations list was clipped with no way to scroll. .logging-page is now its own scroll region (flex:1; min-height:0; overflow-y:auto), mirroring the dashboard's main column.