Turn a local llama.cpp checkout into something you can actually use:
build it, launch models, inspect GPU pressure, recover orphaned servers, and control everything from a browser or terminal.
Why | Screenshots | Start Here | UI | Technical Reference
- You do not have to manually juggle
llama.cppbuilds,llama-serverlaunch commands, VRAM checks, and logs across separate scripts. - The web UI gives you one place to build, launch, monitor, and recover running instances.
- If the backend restarts or crashes, it can re-adopt repo-launched
llama-serverprocesses instead of losing control of them. - The browser UI is layered on top of the same local tools, so terminal users and UI users are working against the same repo, binaries, and models.
- One repo checkout.
- One Python environment.
- One backend command to bring up the control plane.
- One token for the browser UI.
- The backend can serve the built frontend directly, so normal users do not need to run a separate frontend dev server.
The main control-plane page shows backend health, fleet status, host load, GPU pressure, and recent activity without making you dig through logs first.
Expandable GPU detail shows compute load, VRAM use, and which managed processes currently own memory on each device.
The Instances page gives you a proper launcher for llama-server and a way to
recover servers that survived a backend restart.
The Builds page wraps autodevops.py with real options, history, command
preview, and logs.
The Benchmarks page runs llama-bench, keeps the command and logs, and stores
parsed throughput results. The capture below is a live run of
unsloth/Qwen3.5-0.8B-GGUF:Q4_K_XL pinned to the RTX 4060.
The Memory page gives you a quick VRAM and RAM view before you launch a model.
The Library page shows local GGUFs and pulls new ones straight from Hugging Face.
The same control plane also works on narrow screens.
This is the simplest browser-first path for a normal user.
git clone https://github.com/CesarPetrescu/llama-cpp-autodeploy.git
cd llama-cpp-autodeploypython3 -m venv venv
source venv/bin/activate
pip install -U pip
pip install -r requirements.txtcd web/frontend
npm install
npm run build
cd ../..That gives the backend a production frontend to serve at /.
python web_cli.py --init
python web_cli.pyWhat happens here:
python web_cli.py --initcreates.web_config.jsonand prints a bearer token.python web_cli.pystarts the backend on port8787by default.- If
web/frontend/dist/exists, the backend also serves the frontend UI.
Open http://localhost:8787.
On first use:
- go to Settings
- paste the bearer token printed during
--init - save it once
After that, the app can build llama.cpp, launch models, show logs, inspect
GPU pressure, and recover managed instances from the browser.
If you are editing the UI instead of just using it:
cd web/frontend
npm run devThat starts the Vite frontend at http://localhost:5173 and proxies API
requests to the backend at http://127.0.0.1:8787.
| Page | What it is for |
|---|---|
| Dashboard | See backend health, host CPU/RAM/load, GPU pressure, builds, and fleet state |
| Instances | Create, recover, start, stop, restart, and delete llama-server processes |
| Instance logs | Watch live stdout with pause/resume |
| Memory | Estimate placement and VRAM needs before launch |
| Library | Scan local GGUFs and download new ones from Hugging Face |
| Builds | Run autodevops.py, inspect supported options, and stream logs |
| Benchmarks | Run llama-bench, pin tests to specific GPUs, and keep structured throughput history |
| Settings | Set backend URL and bearer token |
| Layer | Role |
|---|---|
autodevops.py |
Build local llama.cpp binaries |
loadmodel.py |
Launch llama-server and reranker processes |
memory_utils.py |
Probe VRAM, RAM, and placement estimates |
web/backend/ |
Auth, state, logs, recovery, and API surface |
web/frontend/ |
Browser UI for overview, builds, instances, memory, and library |
- Linux with Python 3.10+.
- Build tools for
llama.cpp:git,cmake,make,gcc,g++,pkg-config. - NVIDIA drivers and CUDA toolkit if you want CUDA builds or GPU runtime.
- Optional BLAS libraries:
- Intel MKL for
--blas mkl - OpenBLAS for
--blas openblas
- Intel MKL for
Python dependencies are in requirements.txt.
Interactive build flow:
python autodevops_cli.pyNon-interactive build flow:
python autodevops.py --help
python autodevops.py --ref latest --nowSupported build flags:
| Flag | Meaning |
|---|---|
| `--ref <tag | branch |
--now |
Build immediately instead of waiting for the scheduled path |
--fast-math |
Pass fast-math CUDA flags to NVCC |
--force-mmq {auto,on,off} |
Control MMQ CUDA kernels |
--blas {auto,openblas,mkl,off} |
Choose the CPU BLAS backend |
--distributed |
Build GGML RPC support |
--cpu-only |
Skip NVIDIA driver prechecks |
Interactive launcher:
python loadmodel_cli.pyUnified launcher:
python loadmodel.py --helploadmodel.py supports three mutually exclusive modes:
| Mode | Result |
|---|---|
--llm |
Start ./bin/llama-server for completion/chat |
--embed |
Start ./bin/llama-server for embeddings |
--rerank |
Start the Transformers reranker HTTP service |
Examples:
# LLM (local GGUF)
python loadmodel.py --llm ./models/model.gguf --port 45540
# Embeddings (download GGUF from HF repo, auto-select quant/file)
python loadmodel.py --embed Qwen/Qwen3-Embedding-8B-GGUF:Q8_0 --port 45541
# Reranker HTTP server
python loadmodel.py --rerank Qwen/Qwen3-Reranker-8B --host 127.0.0.1 --port 45542For MoE-capable llama-server builds, loadmodel.py also accepts:
--cpu-moe--n-cpu-moe <N>
If the local llama-server binary does not expose these flags,
loadmodel.py exits with a rebuild hint.
Interactive distributed launcher:
python loadmodel_dist_cli.pyThis flow can:
- scan private subnets for RPC workers
- manage the worker host list
- optionally start a local
rpc-server - launch
llama-cliwith--rpcworkers
Standalone rpc-server helper:
python rpc_server_cli.py --help
python rpc_server_cli.py --host 0.0.0.0 --port 5515 --devices 0rpc_server_cli.py requires ./bin/rpc-server to exist.
Backend startup:
python web_cli.py --init
python web_cli.pyThe backend:
- binds to
0.0.0.0by default - requires a bearer token on every request except
GET /api/health - persists managed instances, builds, and benchmark runs in
.web_state.json - tees logs to
web/logs/<id>.log - can re-adopt orphaned repo-launched
llama-serverprocesses on startup - can force that same recovery flow through
POST /api/instances/recover
API surface
- Health:
GET /api/health - Memory:
GET /api/memory/gpus,POST /api/memory/plan,POST /api/memory/auto-split - Models:
GET /api/models/local,GET /api/models/binary-caps,POST /api/models/download - Instances:
GET /POST /api/instances,GET /api/instances/{id},POST /api/instances/{id}/start|stop|restart,DELETE /api/instances/{id},POST /api/instances/recover,WS /api/instances/{id}/logs?token=... - Builds:
GET /POST /api/builds,GET /api/builds/{id},POST /api/builds/{id}/stop,WS /api/builds/{id}/logs?token=... - Benchmarks:
GET /POST /api/benchmarks,GET /api/benchmarks/{id},POST /api/benchmarks/{id}/stop,WS /api/benchmarks/{id}/logs?token=...
Full schema: GET /docs
- The bearer token is the only built-in auth layer.
- Keep
.web_config.jsonreadable only by you. - Prefer binding to
127.0.0.1when you do not need remote access. - WebSocket endpoints use
?token=because browsers cannot attachAuthorizationheaders during the upgrade request. - If you expose the backend beyond a trusted LAN, put it behind HTTPS.
Refresh screenshot assets
cd web/frontend
npx playwright install chromium
WEB_BEARER_TOKEN="$(python - <<'PY'
import json
print(json.load(open('../../.web_config.json', 'r', encoding='utf-8'))['token'])
PY
)" npm run screenshots:readme./start uses ./venv/bin/python and offers a small menu:
./start
./start autodevops
./start loadmodel
./start web [--init]
./start --helpRun unit tests:
python -m unittest discover -s testsCurrent tests cover:
- CUDA home resolution behavior in
autodevops.py - option and config assembly helpers in
autodevops_cli.py
run/ currently includes:
run_qwen30b_llm.shrun_qwen_embed8b.shrun_qwen_reranker8b.sh
These are example launchers for fixed ports and model targets.









