`tool-call`: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars #12034

ochafik · 2025-02-22T22:23:25Z

TL;DR: fixes tool calling of Qwen 2.5 Coder 0.5B/1.5B/3B/7B/32B... at any temperature

instructions to build this branch

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git remote add ochafik https://github.com/ochafik/llama.cpp
git fetch ochafik
git checkout ochafik/tool-bench-prod
cmake -B build -DLLAMA_CURL=1
cmake --build build -t llama-server --parallel --config Release
alias llama-server=./build/bin/llama-server

llama-server --jinja -fa -c 0 -hf unsloth/Qwen2.5-Coder-7B-Instruct-128K-GGUF

Added support for regex grammar triggers, and respect when they should be matching at the start only (was already declared but not implemented; should avoid spurious triggering when the triggers were defined as wide-catches).
- In llama.h, deprecating llama_sampler_init_grammar_lazy (which used to take tokens or words) in favour of llama_sampler_init_grammar_lazy_patterns (which takes tokens or full-string regex patterns w/ a group that marks from where the grammar is triggered)
Dramatically improved tool calls success rate of Qwen 2.5 Coder (Hermes 2 format) w/ more triggers that match what the models tends to output (esp. at higher temperatures) / looser triggers w/ regular expressions
- 32B model can power Cline decently w/ this PR: feat: add llama.cpp provider w/ native tool calls cline/cline#1946
Added scripts/tool_bench.py to evaluate tool call compliance probability of llama-server & ollama on different models, at different temperatures

The following heatmap shows compliance ratio on two super basic tool call tests (hello world & weather tests from examples/server/tests/unit/test_tool_call.py, now shared w/ the bench tool). 3 pairs of columns for llama-server of this PR, baseline llama-server (master) and ollama.

export ARGS=( --n 30 --llama-baseline="$(which llama-server)" --temp -1 --temp 0 --temp 0.5 --temp 0.75 --temp 1 --temp 1.5 --temp 2 --temp 5 ) 

./scripts/tool_bench.py run ${ARGS[@]} --model "Qwen 2.5 Coder 7B Q4_K_M"             --output ../qwenc7b.jsonl   --hf unsloth/Qwen2.5-Coder-7B-Instruct-128K-GGUF:Q4_K_M   --ollama qwen2.5-coder:7b-instruct-q4_K_M
./scripts/tool_bench.py run ${ARGS[@]} --model "Qwen 2.5 Coder 1.5B Q4_K_M"           --output ../qwenc1.5b.jsonl --hf unsloth/Qwen2.5-Coder-1.5B-Instruct-128K-GGUF:Q4_K_M --ollama qwen2.5-coder:1.5b-instruct-q4_K_M

See gist with results for many more models

Notes about results:

the failures of llama-server at temp = 2 are model humour / stylistic choice (Sure! You can use the following Python code... instead of tool call)
ollama seems to only recognize the tool call format of the template, but models like Qwen 2.5 Coder 7B is quite... creative in its tool call outputs, esp. at higher temperatures.
ollama's default temperature seems to be 0.6 (hence why the row w/ @ None kinda fits results of lower rows)
The tests may need further tweaking to accept arguably “correct” answers. The framing of the hello world test is questionable, sometimes models just explain how they would write the code.
The benchmark tool also supports running test_calc_results which evaluates how well a model follows up on tool results. This seems to have more varied failure modes so it's not evaluated by default.

TODO:

Run & share more bench results (esp. other Qwen Coder variants!)
Stabilize tests / ci
Analyze bench times

Update llama-grammar.h update Update llama-grammar.h Update common.h Update common.h Update sampling.cpp Update chat.cpp update test_tool_call.py Update server.cpp Update utils.hpp Update chat.cpp Update test_tool_call.py Update fetch_server_test_models.py

…3 8b tool outputs)

… beginning

GuuD · 2025-02-22T22:36:52Z

Was Qwen 2.5 Coder even trained for tool use? 🤯

ochafik · 2025-02-23T00:18:09Z

Was Qwen 2.5 Coder even trained for tool use? 🤯

@GuuD I guess all models must be to some extent, these days. Their technical report only mentions in passing the fact that BigCodeBench is "primarily aimed at evaluating the ability of tool-use and complex instruction following" and their results on that benchmark look quite decent. But given the variety of outputs the model wraps tool calls in, I doubt they stuck to the syntax used in their jinja template.

…n outputs

…late)

Mushoz · 2025-03-02T12:27:48Z

I see you also opened a PR for Cline to actually utilize this. Is there any chance you could do the same for Roo Code? I have been using both Cline and Roo Code with Qwen2.5-Coder-32b with moderate success, so any improvements to that workflow are more than welcome!

ochafik · 2025-03-02T21:28:36Z

I see you also opened a PR for Cline to actually utilize this. Is there any chance you could do the same for Roo Code? I have been using both Cline and Roo Code with Qwen2.5-Coder-32b with moderate success, so any improvements to that workflow are more than welcome!

Hey @Mushoz, I'll check Roo Code out, but probably won't be before a few weeks. Also, my PR on Cline was somewhat promptly rejected 😓 (cline/cline#1946). Maybe I should have opened with a screencast of the very decent results I've been getting 🫣.

I've started looking at llama-vscode instead (cc/ @ggerganov @igardev ). I'm trying to get streamed tool calls to work on both ends as a follow up to this PR (requires some partial JSON decoding on llama-server's side - nearly done, and streamed / partial JSON decoding on the IDE side - fully done - to get the same kind of streamed diffs as Cline has).

Hey @ngxson, would you have time to take a look at this one?

Mushoz · 2025-03-02T21:47:19Z

@ochafik Roo Code is a fork of cline rapidly growing in popularity. Since it's a fork, I am hopeful it's relatively "easy" to port your PR to Roo Code instead

ggerganov · 2025-03-03T12:52:30Z

@ochafik I saw the PR in Cline, but I don't have much to add as I am neither familiar with Cline nor with the tools/MCP support yet. It seems to me the functionality is powerful, but I need to dedicate some time to understand the details and how it works in order to provide any meaningful feedback. In any case, I think adding this kind of functionality to llama-vscode would be welcome as I believe @igardev (the maintainer of the extension) is interested in such high-level/agentic workflows.

ggerganov

Overall, I think the changes are good, though I haven't ran any tests yet and mostly looked at the syntax. Let's wait for @ngxson review.

ggerganov · 2025-03-03T18:27:28Z

src/llama-grammar.cpp

                    grammar.trigger_buffer.clear();
                    llama_grammar_accept_str(grammar, constrained_str);
-                    LLAMA_LOG_DEBUG("Grammar triggered on word `%s`", word.c_str());
+                    LLAMA_LOG_DEBUG("Grammar triggered on regex: %s", constrained_str.c_str());


Suggested change

LLAMA_LOG_DEBUG("Grammar triggered on regex: %s", constrained_str.c_str());

LLAMA_LOG_DEBUG("Grammar triggered on regex: '%s'\n", constrained_str.c_str());

ggerganov · 2025-03-03T18:29:02Z

src/llama-grammar.h

+    std::vector<std::pair<std::string, std::regex>>
+                             trigger_patterns;         // Regular expressions that trigger a lazy grammar. Must be a full match of the entire generated


Make a struct here. I don't like the std:pair - it always ends up not enough and it's too heavy to read.

ggerganov · 2025-03-03T18:32:38Z

common/common.h

-    std::string word;
-    bool at_start;
+    common_grammar_trigger_type type;
+    std::variant<llama_token, std::string> value;


A bit hesitating about the usage of std::variant here. Overall, it's OK, but it's the first time we use it in the codebase, and I'm not sure it is worth introducing it as a pattern. Having 2 separate members value_token and value_string is the alternative.

ngxson · 2025-03-03T18:37:42Z

My backlog is quite full today, I'll do a review tomorrow

ochafik added 5 commits February 21, 2025 22:16

add scripts/tool_bench.sh & .py

a456911

optionally allow any spaces in json schema grammars (useful for llama…

14a4388

…3 8b tool outputs)

constrain llama json output regardless of function name if matches at…

e2ca8be

… beginning

better error when wrong function called

53266f9

github-actions bot added script Script related testing Everything test related examples python python script changes server labels Feb 22, 2025

ochafik added 2 commits February 22, 2025 22:41

improve error message in weather test

7833c16

add more models to tool_bench.sh

0e1a00e

ochafik added 16 commits February 23, 2025 00:33

benchmark other sizes of qwen 2.5 coder

44740f7

rm duplicate in tool_bench.sh

dd6eb97

add missing <variant> include

0fc6218

fix lints

6fd4972

improve "bad" qwen triggers

2e656f9

add cast to please some gccs

fbd3c19

ditch server test request retry logic

62a1416

fix flake8 lints

596ff7f

nits

fe6968f

remove any_spaces grammar option, allow extra line for airy llama jso…

1caacd5

…n outputs

Update test_tool_call.py

789a3e1

test w/ beefier qwen 2.5 coder 3b

6493a14

revert some test_hello_world diffs

cc817a0

diff

ead02c6

Update test_tool_call.py

d7acf2c

add requirements for tool_bench

0db4073

ochafik added 4 commits February 23, 2025 13:00

fix test_thoughts deepseek test expectation

0ce606b

Update README.md

a3cde16

update relaxed newline space rule in grammar tests

79ad623

support add_generation_prompt query parameter (useful for /apply_temp…

3fe208a

…late)

This was referenced Feb 25, 2025

feat: add llama.cpp provider w/ native tool calls cline/cline#1946

Closed

Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars #9639

Merged

ochafik added 4 commits February 25, 2025 10:59

Merge remote-tracking branch 'origin/master' into tool-bench-prod

fe8c79b

token cast tweak for gcc

99d2d80

fix warning on gcc13 w/ uninitialized variant

c7fa19a

fix python lints

6e5a830

ochafik marked this pull request as ready for review February 25, 2025 12:01

ochafik requested a review from ngxson as a code owner February 25, 2025 12:01

ochafik added 5 commits February 25, 2025 12:14

fix gcc13 warning

0b5d105

fix pyright lints in tool_bench.py

7bcc5af

Merge remote-tracking branch 'origin/master' into tool-bench-prod

d1f48d0

update readme w/ link to tool call

fc19192

tool-bench: add --ctk, --ctv, --fa flags

60f28ef

ggerganov approved these changes Mar 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`tool-call`: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars #12034

`tool-call`: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars #12034

ochafik commented Feb 22, 2025 •

edited

Loading

GuuD commented Feb 22, 2025

ochafik commented Feb 23, 2025 •

edited

Loading

Mushoz commented Mar 2, 2025

ochafik commented Mar 2, 2025 •

edited

Loading

Mushoz commented Mar 2, 2025

ggerganov commented Mar 3, 2025

ggerganov left a comment

ggerganov Mar 3, 2025

ggerganov Mar 3, 2025

ggerganov Mar 3, 2025

ngxson commented Mar 3, 2025

	LLAMA_LOG_DEBUG("Grammar triggered on regex: %s", constrained_str.c_str());
	LLAMA_LOG_DEBUG("Grammar triggered on regex: '%s'\n", constrained_str.c_str());

		std::vector<std::pair<std::string, std::regex>>
		trigger_patterns; // Regular expressions that trigger a lazy grammar. Must be a full match of the entire generated

tool-call: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars #12034

Are you sure you want to change the base?

tool-call: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars #12034

Conversation

ochafik commented Feb 22, 2025 • edited Loading

GuuD commented Feb 22, 2025

ochafik commented Feb 23, 2025 • edited Loading

Mushoz commented Mar 2, 2025

ochafik commented Mar 2, 2025 • edited Loading

Mushoz commented Mar 2, 2025

ggerganov commented Mar 3, 2025

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov Mar 3, 2025

Choose a reason for hiding this comment

ggerganov Mar 3, 2025

Choose a reason for hiding this comment

ggerganov Mar 3, 2025

Choose a reason for hiding this comment

ngxson commented Mar 3, 2025

`tool-call`: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars #12034

`tool-call`: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars #12034

ochafik commented Feb 22, 2025 •

edited

Loading

ochafik commented Feb 23, 2025 •

edited

Loading

ochafik commented Mar 2, 2025 •

edited

Loading