Support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client by matteoserva · Pull Request #13196 · ggml-org/llama.cpp

matteoserva · 2025-04-29T18:58:24Z

This PR implements support for setting additional jinja parameters.
An example of this is enable_thinking in the Qwen3 models template.

Main features:

Setting jinja variables for command line using --chat_template_kwargs or the environment variable
Setting variables per request in the OAI compatible api using the chat_template_kwargs parameter
Compatibility with the VLLM API

Notice

As per server: add --reasoning-budget 0 to disable thinking (incl. qwen3 w/ enable_thinking:false) #13771 the preferred way for disabling thinking with a command line argument is now --reasoning-budget 0. The command line setting can be overridden anyway by passing the chat_template_kwargs during the request to the OAI compatible API
There is ongoing discussion to support setting the reasoning budget per request in Feature Request: add per-request "reasoning" options in llama-server #13272. This would allow to completely disable thinking by setting the budget to 0

Other info

The official template is still only partially compatible. I modified it to use only supported features.
It's here: ~~https://pastebin.com/16ZpCLHk~~ https://pastebin.com/GGuTbFRc
And should be loaded with llama-server --jinja --chat-template-file {template_file}

It fixes #13160 and #13189

Test it with:

enable_thinking=false. Expected: {"prompt":"\n<|im_start|>user\nGive me a short introduction to large language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"}

curl http://localhost:8080/apply-template -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen3-8B",
  "messages": [
    {"role": "user", "content": "Give me a short introduction to large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "top_k": 20,
  "max_tokens": 8192,
  "presence_penalty": 1.5,
  "chat_template_kwargs": {"enable_thinking": false}
}'

enable_thinking=true

curl http://localhost:8080/apply-template -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen3-8B",
  "messages": [
    {"role": "user", "content": "Give me a short introduction to large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "top_k": 20,
  "max_tokens": 8192,
  "presence_penalty": 1.5,
  "chat_template_kwargs": {"enable_thinking": true}
}'

enable_thinking undefined

curl http://localhost:8080/apply-template -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen3-8B",
  "messages": [
    {"role": "user", "content": "Give me a short introduction to large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "top_k": 20,
  "max_tokens": 8192,
  "presence_penalty": 1.5
}'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client#13196

Support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client#13196
CISC merged 16 commits intoggml-org:masterfrom
matteoserva:enable_thinking

matteoserva commented Apr 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

Conversation

matteoserva commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Main features:

Notice

Other info

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

matteoserva commented Apr 29, 2025 •

edited

Loading