-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tool-call
: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars
#12034
base: master
Are you sure you want to change the base?
Conversation
Update llama-grammar.h update Update llama-grammar.h Update common.h Update common.h Update sampling.cpp Update chat.cpp update test_tool_call.py Update server.cpp Update utils.hpp Update chat.cpp Update test_tool_call.py Update fetch_server_test_models.py
…3 8b tool outputs)
Was Qwen 2.5 Coder even trained for tool use? 🤯 |
@GuuD I guess all models must be to some extent, these days. Their technical report only mentions in passing the fact that BigCodeBench is "primarily aimed at evaluating the ability of tool-use and complex instruction following" and their results on that benchmark look quite decent. But given the variety of outputs the model wraps tool calls in, I doubt they stuck to the syntax used in their jinja template. |
I see you also opened a PR for Cline to actually utilize this. Is there any chance you could do the same for Roo Code? I have been using both Cline and Roo Code with Qwen2.5-Coder-32b with moderate success, so any improvements to that workflow are more than welcome! |
Hey @Mushoz, I'll check Roo Code out, but probably won't be before a few weeks. Also, my PR on Cline was somewhat promptly rejected 😓 (cline/cline#1946). Maybe I should have opened with a screencast of the very decent results I've been getting 🫣. I've started looking at llama-vscode instead (cc/ @ggerganov @igardev ). I'm trying to get streamed tool calls to work on both ends as a follow up to this PR (requires some partial JSON decoding on llama-server's side - nearly done, and streamed / partial JSON decoding on the IDE side - fully done - to get the same kind of streamed diffs as Cline has). Hey @ngxson, would you have time to take a look at this one? |
@ochafik Roo Code is a fork of cline rapidly growing in popularity. Since it's a fork, I am hopeful it's relatively "easy" to port your PR to Roo Code instead |
@ochafik I saw the PR in Cline, but I don't have much to add as I am neither familiar with Cline nor with the tools/MCP support yet. It seems to me the functionality is powerful, but I need to dedicate some time to understand the details and how it works in order to provide any meaningful feedback. In any case, I think adding this kind of functionality to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think the changes are good, though I haven't ran any tests yet and mostly looked at the syntax. Let's wait for @ngxson review.
grammar.trigger_buffer.clear(); | ||
llama_grammar_accept_str(grammar, constrained_str); | ||
LLAMA_LOG_DEBUG("Grammar triggered on word `%s`", word.c_str()); | ||
LLAMA_LOG_DEBUG("Grammar triggered on regex: %s", constrained_str.c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLAMA_LOG_DEBUG("Grammar triggered on regex: %s", constrained_str.c_str()); | |
LLAMA_LOG_DEBUG("Grammar triggered on regex: '%s'\n", constrained_str.c_str()); |
std::vector<std::pair<std::string, std::regex>> | ||
trigger_patterns; // Regular expressions that trigger a lazy grammar. Must be a full match of the entire generated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make a struct here. I don't like the std:pair
- it always ends up not enough and it's too heavy to read.
std::string word; | ||
bool at_start; | ||
common_grammar_trigger_type type; | ||
std::variant<llama_token, std::string> value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit hesitating about the usage of std::variant
here. Overall, it's OK, but it's the first time we use it in the codebase, and I'm not sure it is worth introducing it as a pattern. Having 2 separate members value_token
and value_string
is the alternative.
My backlog is quite full today, I'll do a review tomorrow |
TL;DR: fixes tool calling of Qwen 2.5 Coder 0.5B/1.5B/3B/7B/32B... at any temperature
instructions to build this branch
llama.h
, deprecatingllama_sampler_init_grammar_lazy
(which used to take tokens or words) in favour ofllama_sampler_init_grammar_lazy_patterns
(which takes tokens or full-string regex patterns w/ a group that marks from where the grammar is triggered)scripts/tool_bench.py
to evaluate tool call compliance probability ofllama-server
&ollama
on different models, at different temperaturesThe following heatmap shows compliance ratio on two super basic tool call tests (hello world & weather tests from
examples/server/tests/unit/test_tool_call.py
, now shared w/ the bench tool). 3 pairs of columns for llama-server of this PR, baseline llama-server (master) and ollama.See gist with results for many more models
Notes about results:
Sure! You can use the following Python code...
instead of tool call)@ None
kinda fits results of lower rows)test_calc_results
which evaluates how well a model follows up on tool results. This seems to have more varied failure modes so it's not evaluated by default.TODO: