-
Notifications
You must be signed in to change notification settings - Fork 14.7k
Autoparser - complete refactoring of parser architecture #18675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Does this mean we don’t need to write a parser anymore, and it will be automatically generated from the chat template? |
Yup, that's the gist of it. |
|
This feels almost magical. How does it work? Does it detect common patterns in the rendered template output? What happens if the chat template requires additional arguments? |
Yeah, it does differential analysis - it prepares different inputs to the template and then tests the outputs, for example, by using a the same function signature with a different name you can identify where the function name goes, by using the same function with one and two parameters you can identify how parameters are passed etc. etc. The nice thing is, I managed to squish it to just 2k lines of code (1k for analysis and 1k for helpers), so it's not even that bloated. As for custom inputs - I assume standard inputs here and that's what most template makers try to adhere to anyway. If not, you end up with a custom handler like for Ministral - but as a followup I want to separate handlers from parsers (since passing extra params is much eaasier than handling an entire template from scratch) or even add autodetection for common custom keywords (we're going to have to support "reasoning" in addition to "reasoning_content" at some point because vLLM is moving to that). |
dc7dd03 to
5519998
Compare
|
This approach does not seem to work well for models like Additionally, I am planning to add a new custom parser for |
|
I've heard of those mysterious tool calls inside thinking blocks for K2-Thinking, but I've yet to know if they are an actual thing or if they are just an artifact of low quantization. To be honest, outside of the native provider, I haven't seen K2-Thinking implemented anywhere in a working fashion. The Chutes version that I tested quite a few times bugs out on tool calling extremely often. I'm really skeptical of modifying anything based on hearsay and things "floating around". I remember the discussion here about interleaved thinking and I myself was convinced that meant models could have multiple As for the Mirocode, I guess you're talking about adapting the Python code-based stuff they showed (the one that uses separate tags for MCP servers and code calling)? You can see how custom parsers are defined in |
420f7bf to
9ea502a
Compare
a963e86 to
3594bd5
Compare
|
All right, I've reached the "all tests passed" phase for |
| // Some templates (like Command-R7B) include reasoning markers in tool outputs but not in prompts | ||
| if (result.content.reasoning_start.empty() && !result.tools.tool_section_start.empty()) { | ||
| // Known reasoning end marker patterns that might be embedded in tool_section_start | ||
| std::vector<std::pair<std::string, std::string>> reasoning_patterns = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just an idea, not sure if it's worth doing: can we move all the heuristic definitions like this to the top of this file, so it can be found easier in the future?
also may be better to add const and maybe using const char *:
| std::vector<std::pair<std::string, std::string>> reasoning_patterns = { | |
| const std::vector<std::pair<const char *, const char *>> reasoning_patterns = { |
| size_t content_pos = output.find("ACTUAL_CONTENT_HERE"); | ||
|
|
||
| if (content_pos != std::string::npos) { | ||
| // For recipient-based format, find the last occurrence of tool_call_start_marker | ||
| // before the content. The marker is from that position to the content (including the newline). | ||
| size_t marker_pos = output.rfind(result.tools.tool_section_start, content_pos); | ||
|
|
||
| if (marker_pos != std::string::npos && marker_pos < content_pos) { | ||
| // Find the newline after the marker | ||
| size_t newline_pos = output.find('\n', marker_pos); | ||
|
|
||
| if (newline_pos != std::string::npos && newline_pos < content_pos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this pattern seems to be a bottom-up parser to me: we find a haystack, then if it exists, find something before/after it, so on...
just wondering if we can use peg parser here, WDYT @aldehir ?
otherwise, I would prefer avoiding nested if branches if possible, probably by re-implement a very light-weight bottom-up parser that wraps around std::string::find
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To search for the tool markers and content? Yes, I suppose you could do something like this:
p.until(tool_section_start) + p.tag("tool-start", p.literal(tool_section_start))
+ p.until("ACTUAL_CONTENT_HERE") + p.tag("content-start", p.literal("ACTUAL_CONTENT_HERE"))
+ p.until("\n") + p.tag("newline-pos", p.literal("\n"))
+ ... ;Then traverse the AST to get the positions from each tagged node.
Although, I'm not sure it would be the best application in this particular case.
| trim_whitespace(cs.reasoning_end); | ||
|
|
||
| // If we found reasoning_end but not reasoning_start, try to derive it from reasoning_end | ||
| // For example: </think> -> <think>, </|END_THINKING|> -> <|START_THINKING|> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like it's something easy to do with peg, pseudo-code:
auto core_name = p.tag("core_name", p.chars("[a-zA-Z_]"));
auto tag_reasoning = p.rule("tag_reasoning", "<" + p.optional("/") + p.optional("|") + core_name + p.optional("|") + ">");
// then, get the core_name valueThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just note that we should avoid doing too many unrelated (i.e. whitespace) changes here, otherwise it's quite difficult for me to see which part of the logic is changed - they are currently mixed with whitespace changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ARGH :/ ye, had a version of VSCode on my laptop that had clangd "autoformat on save" feature turned on :/
| |-- Compose into final parser | ||
| | | ||
| v | ||
| common_chat_params (parser, grammar, triggers, preserved_tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
continue the discussion from your comment: #18961 (review)
I think a cleaner solution is to allow common_chat_parse to accept both common_chat_params and common_chat_parser_params as input params.
common_chat_paramscontains the parser config, it is tied to the input template. The instance ofcommon_chat_paramsis created once when the server/cli/etc launch, and remain there until application exitscommon_chat_parser_paramscontains the per-request config, created upon receiving a new request. The params can be controlled via API request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second though, that may not work well with the test infrastructure. If we do what I said above, test parsing will now depend on template formatting
|
I'll be honest, there is a lot of complexity in the analysis. I feel it imposes a high cognitive load to understand, or maybe I'm just not that smart. The premise that any chat template will be supported OOTB is infeasible. Case in point: ddf0448 adds logic to support GLM-4.7, it wasn't automatically supported on release. If we have to modify the analysis to accept new templates, then we shift the burden to maintaining the analysis instead of the actual parsing. One is more complex than the other. |
I can relate to this assessment. At first, the idea of an auto-analyzer/autoparser sounds plausible because the chat template is Turing-complete. However, I'm quite worry that the actual implementation contains too many heuristics. The code also has too many nested I acknowledge the efforts put into this PR, but probably we should reflect more a bit about the long-term advantages / disadvantages of this approach. On one side, since newer chat templates are based on older one, it's likely that the autoparser can correctly support it given the right number of heuristics. But on the other side, as @aldehir mentioned, maintaining the autoparser can be challenging as it's more complex, and one small change can potentially affect multiple chat templates if not tested thoroughly. Just for ref, but vllm does maintain a list of per-model parsers, which is somewhat equivalent to the non-autoparser version in llama.cpp. Currently, our tool/thinking parser and message transformations are mixed together. I'm wondering if we should focus more on refactoring our definition of the chat system. |
|
@aldehir @ngxson I understand your reservations, just wanted to put the following out there: The core reason I want to do this and not the "multiple parsers" approach is that I found that oftentimes, the "building blocks" of the various parsers have things that are also useful in other parsers, so you either (a) maintain a series of abstractions that you have to manually merge or (b) copy-paste stuff and risk that you forget certain workarounds / solutions that were implemented in one template and not in another. Remember there's also |
|
Before start writing my response, I just want to be clear that I do apperciate your thoughts and your dedication to the chat system, as well as your contributions on different parts of the project. Now, I'm having 2 main concerns:
It can be quite risky because we still not yet know if anyone will be interested in maintaining this. Ofc, if the entry barrier is lower (i.e. you find a way to refactor the code), then it does worth trying. But if not, from my experience working for a while in llama.cpp: between an older, less functional system with maintain and a newer, more functional one but less people to maintain, I think the older one will likely be preferable (I'm talking about the consensus from other maintainers) Example for that can be: linenoise.cpp, minja, llama-run, the old Web UI
I'm a bit doubt about this assessment. Let's consider a given template Now, imagine we have a function If Without the autoparser, Ofc, we can prevent such thing to happen by having extensive tests for |
I think this can be achieved with a simpler heuristic at the cost of not being exhaustive. The PEG parser lends itself well to this by being composable. Simply check for the existence of a tag, or pairs of tags, for each component (reasoning, content, tool call). At that point, you can compose a parser based off those tags. This seems like the gist of what you have here, but checking if it exists in the template might be good enough.
vLLM does something similar when parsing, although I think they run the parsers in multiple passes. I don't know if they try to detect the proper parser for each component. Like I said, may not be exhaustive, but could address many templates. And unlike the current generic parser, it could support tool calling formats such as Qwen3-Coder as well. |
|
@ngxson Yeah, I understand the skepticism. But the way I see it, the parser code was already (a) unmaintained and (b) a complete mess, that's why I decided to refactor it. And this refactoring does focus on maintainability, although admittedly the entry level is a bit high. While I understand the will to be conservative, I feel like the approach in vLLM etc. just invites copy-pasting and bad programming practices. The mechanism you describe with the regression-safety of separate templates is illusory if what everyone does is copy-paste fragments of already-existing parsers to meet their needs. Then, when a bug is encountered in one of the parsers, there is no way to automatically propagate the fix to other parsers and you're left with either (a) someone manually recognizing that it's the same pattern or (b) people reporting it for the other template. I'll give you an example - when doing agentic tests with the autoparser, I ran into an issue with parsing JSON arrays in constructed (quasi-XML) format. The problem is simple: the thing between the tags can be a string or a JSON array. If it's a JSON array, the parser should treat it as such - but it might as well be a code string. My first approach was to use a heuristic, but of course the proper approach is to look at the argument type as well. My takeaway from this is such: if I fix it now and add a test, it doubles down as a test for all other parsers sharing this mechanism. But if this is done in a "parser-per-template" scenario, the person doing the change needs to actually adjust all the parsers for similar templates. Of course, the autoparser approach does not technically limit regressions. But exactly because tests for one template double down as tests for all templates with similar mechanisms, it is very hard to actually introduce a breaking change that does not fail Nevertheless, I do realize that the current code still remains in its "vibe-coded" stage, although it was written under my supervision so I understand the logic behind it :) I will rewrite it to a more "human-maintainable" format, there was just a lot of stuff to do to adjust to current changes, so I haven't had time to do it. I do think it is similar to the changes to WebUI where @allozaur rewrote it to a more complex, but also more advanced and functional version. But of course it's up to you to approve the architecture. Paging @ggerganov as well since he might want to add his view to this. |
|
@aldehir Yes, what you are describing is basically what I tried to do here, although it might not be clear from the code at this state :) What you call "barriers" is what I call "markers" in the analysis. I will try to refactor the code to make this approach more clear. |
I see your point, but in theory it will be tricky to test all feature (or component) combinations. For example, if a template has feature A, B and another has feature B, C, then the test must cover AB, BC, AC and potentially ABC. Although this is just from a very theoretical POV, I think it can be related to @aldehir's comment from earlier: the autoparser, instead of exhaustively detect all possible features A B, can maybe detect a larger D=A+B. From your example, the Qwen3 tool calling can potentially be a larger feature called This will effectively make the implementation, especially Ofc your current version already doing that, but the number of features returned by |
| // For FUNC_PREFIXED_INDEXED format (e.g., Kimi-K2) | ||
| std::string per_call_start; // e.g., "<|tool_call_begin|>" | ||
| std::string function_namespace; // e.g., "functions." (prefix before function name) | ||
| std::string args_marker; // e.g., "<|tool_call_argument_begin|>" | ||
| std::string per_call_end; // e.g., "<|tool_call_end|>" | ||
|
|
||
| // For FUNC_BRACKET_TAG format (e.g., Mistral Small 3.2) | ||
| std::string id_marker; // e.g., "[CALL_ID]" - marker before tool call ID | ||
|
|
||
| // For FUNC_MARKDOWN_CODE_BLOCK format (e.g., Cohere Command-R Plus) | ||
| std::string code_block_marker; // e.g., "Action:" - text marker before code block | ||
| std::string code_block_language; // e.g., "json" - language identifier in code fence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To continue my point above, (please correct if I'm wrong) seems like this list of features is model-specific and for example, there is likely no chance that a model in the future will use FUNC_PREFIXED_INDEXED and FUNC_BRACKET_TAG at the same time
So, I think these can be grouped into a larger feature, and since some tokens are practically unchanged, like <|tool_call_begin|>, we can hard-code them inside the parser code. That will make the code look more explicit.
This is a huge endeavor that I promised back when I applied for maintaining the parser code. The legacy parser code was hard to maintain and buggy and supporting new models with it was really annoying. There was a worthwhile contribution by @hksdpc255 to add some XML toolcalling abstractions, but that was still just a patch on an open wound.
Thanks to @aldehir and his PEG parser, I managed to create an autoparser mechanism, using all the currently supported templates, their parsers and test cases as base. The idea is simple: most models' syntax follows the general pattern of:
<reasoning_markers> <reasoning_content> <end_of_reasoning_markers> <content_markers> <main_content> <end_of_content_markers> <tool_call_markers> ( <json> | <function marker> <args json> | <function marker> <args marker> <value json> ) <end_of_tool_call_marker>Of course, some elements might not be present in a given template, but that's the general structure. Since this is a pretty finite structure, it's possible to determine the relevant elements by differential analysis - similar to how Minja already does capability detection, but more fine-grained, because by comparing various template outputs, we get to actually extract the relevant markers.
Some models will obviously not get handled so easily. However, in the course of implementing the mechanism, only two models remained that needed to get their separate parsers: Ministral and GPT-OSS, and the prior not because of its complexity, but of the need to rewrite the message structure passed to the template. GPT-OSS is a different beast since it supports arbitrarily many interleaved blocks, so it doesn't fit into the scheme that I mentioned above (but its parser has been rewritten to PEG as well).
This is currently anchored on Minja and uses its capability detection, but since the differential analysis already does its own capability detection, I fully expect to throw that part out and base this on @ngxson 's #18462 instead.
Obsoletes #18353 (sorry @ochafik - I know you put a lot of work into that).
Old parsers, tests and all supporting code are thrown out, templates got new PEG-parser based testcases, all of them now also test streaming behavior. I have tested this extensively on agentic coding (mostly with OpenCode) to ensure that this actually works (my wish to refactor the parser code was mostly caused by my prior experience with agentic coding on llama.cpp, which was extremely buggy with a lot of models, this is an attempt to remedy that). Hopefully, having one unified codebase with a largely reduced line-of-code count will make it easier to fix any potential errors.
This also means that there is no longer need to provide support for new models' specific templates unless they have some odd constructs - they should be supported out of the box. There's a new tool called
debug-template-parserthat you can point to any Jinja template file or GGUF model with an embedded Jinja template and have it spit out the details of the generated autoparser + toolcaling grammar.Oh, important note: all Minja polyfills have been disabled. Working templates are now required. Why I see why a year and a half ago having proof-of-concept code that supported tool calling on models that didn't natively have tool calling might've been useless, right now supporting that is making it harder to properly support current and actually used models. Therefore, a functional template with tool calling is required if someone wants tool calling.
I want to ask everyone from the community who can to test this. I will keep this branch current with master, I tried to test this as much as I could, but I'm just one person doing this after work, so obviously my testing abilities were limited. I will keep this as draft until I've gathered enough feedback and testing data.
To not clutter the main repository's issue tracker, please report bugs either (a) in this thread or (b) in my issue tracker https://github.com/pwilkin/llama.cpp/issues
AI DISCLOSURE: Gemini Pro 3, Flash 3, Opus 4.5 and GLM 4.7 would like to admit that a human element did at some points interfere in the coding process, being as bold as to even throw most of the code out at some point and demand it rewritten from scratch. The human also tinkered the code massively, removing a lot of our beautiful comments and some code fragments that they claimed were useless. They had no problems, however, in using us to do all the annoying marker arithmetic. Therefore, we disavow any claim to this code and cede the responsibility onto the human.