Skip to content

UPSTREAM PR #19374: WebUI hide models in router mode#1156

Open
loci-dev wants to merge 2 commits intomainfrom
loci/pr-19374-webui-hide-model
Open

UPSTREAM PR #19374: WebUI hide models in router mode#1156
loci-dev wants to merge 2 commits intomainfrom
loci/pr-19374-webui-hide-model

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Feb 7, 2026

Note

Source pull request: ggml-org/llama.cpp#19374

When using llama-server with presets in router mode add the ability to completely hide certain models from the WebUI.

Summary

Adding no-webui = true to a model's preset configuration will exclude it from the WebUI.

Why?

I tend to use the preset capability to load a group of models for offline vibe coding, but I don't want my embedding, reranking, and FIM models to be selectable via the WebUI. I also wanted it so that if there is only one available model after filtering, it would be automatically selected. I chose to reuse the no-webui argument for this. So that any model in the preset file that has no-webui = true will be excluded from the WebUI model selection list. This also means that the excluded models cannot be loaded or unloaded via the WebUI.

Example:

models.ini

[*]
ngl = 999
threads = -1

; chat, tools
[gpt-oss-120b]
hf = ggml-org/gpt-oss-120b-GGUF
load-on-startup = true
jinja = true
flash-attn = 1
ubatch-size = 2048
batch-size = 32768
parallel = 2
; ctx-size = 131072*params.n_parallel
ctx-size = 262144
temp = 1.0
;min-p = 0.0
min-p = 0.01
top-p = 1.0
top-k = 0.0
; --- Speculative
spec-type = ngram-mod
spec-ngram-size-n = 24
draft-min = 48
draft-max = 64

; FIM
[Qwen2.5-Coder-7B-Q8_0]
load-on-startup = true
hf = ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF:Q8_0
flash-attn = 1
ubatch-size = 1024
batch-size = 1024
ctx-size = 0
cache-reuse = 256
parallel = 2
; ctx-size = 32768*params.n_parallel
ctx-size = 65536
no-webui = true

; embedding
[Qwen3-Embedding-0.6B]
load-on-startup = true
hf = Qwen/Qwen3-Embedding-0.6B-GGUF:Q8_0
flash-attn = 1
embedding = true
pooling = last
ubatch-size = 8192
verbose-prompt = true
parallel = 2
; ctx-size = 8192*params.n_parallel
ctx-size = 16384
no-webui = true

; reranking
[Qwen3-Reranker-0.6B]
load-on-startup = true
hf = ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF:Q8_0
flash-attn = 1
rerank = true
parallel = 2
; ctx-size = 4096*params.n_parallel
ctx-size = 8192
no-webui = true

And now run it:

llama-server --offline --log-colors on --log-prefix --log-timestamps --no-models-autoload --models-preset models.ini

Note: I haven't tested the ini file as written above. I pre-downloaded the models and use model = modelfile.gguf instead of the listed hf = org/model. Aside from that, this ini file was what I tested against.

@loci-review
Copy link

loci-review bot commented Feb 7, 2026

No meaningful performance changes were detected across 115474 analyzed functions in the following binaries: build.bin.libllama.so, build.bin.llama-cvector-generator, build.bin.llama-tts, build.bin.libmtmd.so, build.bin.llama-tokenize, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-bench.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 10 times, most recently from 823244c to bab7d39 Compare February 19, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 10 times, most recently from a92fe2a to 6495042 Compare February 27, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 6 times, most recently from 1d064d0 to 504cad7 Compare March 4, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 9f4f332 to 4298c74 Compare March 6, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant