UPSTREAM PR #18353: [WIP] tool-call: experimental migration of all parsers to peg-parser infra (w/ better test coverage) #692
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Mirrored from ggml-org/llama.cpp#18353
TL;DR: it's a lot, but there's a lot more testing than before.
Building on the PEG parser infrastructure introduced in #17136 by @aldehir, this is an experimental effort to migrate all chat template formats to the unified PEG approach.
Why migrate? The current monolithic
common/chat.cpphas grown to ~25 ad-hoc parser implementations that are difficult to maintain. Lots of parsing bugs are hard to reproduce and diagnose (esp. if the user wasn't in--verbosemode).The PEG infrastructure offers a cleaner path forward, w/ strong guarantees (modulo bugs) that what is allowed to be generated should be parseable.
How to Test
Changes:
common/chat-parsers/*.cpp- 28 modular parser implementations--experimental-new-parsers- defaults to off, nothing changes by defaultNew "Needle" Streaming Tests
Existing streaming tests (
tools/server/tests/unit/test_tool_call.py) required loading real models and cover only a subset of formats. This PR adds systematic coverage for all 21 formats without the model-loading overhead.This migration was designed to be safe through systematic test constraints:
21 formats x 6+ scenarios = up to 126 regression tests (some scenarios filtered based on format capabilities)
Each format tests:
How Needle Tests Work
The "needle" technique injects unique marker pairs into each semantic field. For example, in Hermes 2 Pro format with thinking and a tool call:
The test parses this message at every character boundary (simulating streaming), and verifies:
This aims to prove parsers are truly incremental: partial input produces partial output, fields stream in proper order, and nothing is buffered unnecessarily.
Known Limitations
The PEG implementation has gaps vs legacy (TBC):
allOf/anyOf/$refpatterns not fully handleduntil_maxw/ weird implementation, maybe we just drop maxLength on xml formats)Proposed Migration Plan
--experimental-new-parserscommon/chat-parser.cpp: ~28 legacy parser functions (~900 lines)common/chat.cpp: ~19 legacy init functions (~600 lines)common/chat-peg-parser.cpp/.h: class-based builders/mappers (~220 lines)common/chat-parser-xml-toolcall.cpp/.h: XML grammar builder (~900 lines) - new PEG parsers generate grammars directly from their parser definitionsFollow up work
supports_tool_call_id- Whether tool calls include IDsreasoning_requires_tools- Whether thinking mode only works with toolstools_emit_content_with_calls- Whether tool calls can include content