feat(tokenizer): detect tool call arguments format from chat template#1028
feat(tokenizer): detect tool call arguments format from chat template#1028CatherineSue wants to merge 6 commits intomainfrom
Conversation
Add ToolCallArgumentsFormat enum (Dict vs String) that detects whether a chat template expects tool_call arguments as parsed dicts (using |tojson, |items filters) or as raw JSON strings (plain concatenation). DeepSeek V3.1/R1 templates use plain string concat for arguments, while Llama, Qwen, and most modern templates follow the Transformers convention of expecting dicts with |tojson rendering. Detection runs once at tokenizer init alongside content format detection. process_tool_call_arguments() is now conditional — it only converts string→dict when the template expects Dict format. Signed-off-by: Chang Su <chang.s.su@oracle.com>
…guments detection Add two additional patterns to detect_arguments_format_from_text: 1. Direct dict iteration (Llama 4): `for param in tool_call.arguments` 2. Alias-then-method (GLM, MiniMax): `set _args = arguments` + `.items()` These were false negatives found during exhaustive audit of 25+ model chat templates across /raid/models, HuggingFace, and e2e test models. Signed-off-by: Chang Su <chang.s.su@oracle.com>
…ates The direct dict iteration check (Pattern 2) scanned backwards from "arguments" to the last newline, then checked if "for " and " in " appeared anywhere in that prefix. On single-line templates like DeepSeek V3.1, the prefix spans the entire template and matches unrelated for-loops (e.g. "for tool in message['tool_calls']"). Fix: walk backwards from "arguments" over the dot-access prefix (e.g. "tool_call.") and verify " in " immediately precedes it, rather than checking the entire line. Signed-off-by: Chang Su <chang.s.su@oracle.com>
…ring-concat only Instead of detecting all possible dict patterns (|tojson, |items, direct iteration, alias+.items()) and defaulting to string, flip the logic: detect the string-concat pattern (arguments followed by +) and default to dict. Dict is the Transformers convention used by 22+ models (Llama, Qwen, GLM, MiniMax, Nemotron, etc.). Only DeepSeek V3/R1 family uses string concatenation. Detecting the minority case is simpler and less fragile. Signed-off-by: Chang Su <chang.s.su@oracle.com>
Remove test_args_format_string_concat_deepseek (covered by the real template snippet test) and test_args_format_empty_template (trivial early return). Signed-off-by: Chang Su <chang.s.su@oracle.com>
📝 WalkthroughWalkthroughAdds detection of tool-call argument formatting (Dict vs String) to chat templates, exposes it via a new Tokenizer trait method, implements the method across tokenizers, and conditions model-gateway preprocessing of tool-call arguments on the detected format. Changes
Sequence DiagramsequenceDiagram
participant Client as Client
participant Gateway as Model Gateway
participant Tokenizer as Tokenizer
participant Template as ChatTemplateState
participant Processor as Argument Processor
Client->>Gateway: process_chat_messages(messages, tokenizer)
Gateway->>Tokenizer: tool_call_arguments_format()
Tokenizer->>Template: arguments_format()
Template-->>Tokenizer: ToolCallArgumentsFormat
Tokenizer-->>Gateway: Format (Dict or String)
alt Format is Dict
Gateway->>Processor: process_tool_call_arguments(transformed_messages)
Processor->>Processor: Parse string -> JSON
Processor-->>Gateway: Processed messages
else Format is String
Gateway-->>Gateway: Skip argument parsing
end
Gateway-->>Client: Return processed messages
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
| /// | ||
| /// Inspects the template text for how `arguments` is used: | ||
| /// - If a dict-consuming filter (`|tojson`, `|items`, `|dictsort`, `|tojson(...)`) is applied, | ||
| /// the template expects arguments as dicts → return `Dict`. | ||
| /// - If `arguments` appears in the template without such filters (plain string concatenation), | ||
| /// the template expects arguments as JSON strings → return `String`. | ||
| /// - If `arguments` does not appear at all (no tool support), default to `Dict` |
There was a problem hiding this comment.
🟡 Nit: This docstring describes a two-pronged detection (positive detection of |tojson/|items/|dictsort → Dict, absence → String), but the implementation in detect_arguments_format_from_text doesn't check for those filters at all. It only detects the + string-concat pattern and defaults to Dict for everything else.
The result is the same, but a reader of this doc would expect filter-matching logic that doesn't exist. Consider updating to match the actual strategy:
| /// | |
| /// Inspects the template text for how `arguments` is used: | |
| /// - If a dict-consuming filter (`|tojson`, `|items`, `|dictsort`, `|tojson(...)`) is applied, | |
| /// the template expects arguments as dicts → return `Dict`. | |
| /// - If `arguments` appears in the template without such filters (plain string concatenation), | |
| /// the template expects arguments as JSON strings → return `String`. | |
| /// - If `arguments` does not appear at all (no tool support), default to `Dict` | |
| /// Inspects the template text for how `arguments` is used: | |
| /// - If `arguments` is followed by `+` (Jinja string concatenation, e.g. DeepSeek V3.1), | |
| /// the template expects arguments as JSON strings → return `String`. | |
| /// - Otherwise (including when `arguments` does not appear at all), | |
| /// default to `Dict` (the Transformers convention). |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ec6a067c31
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if i < bytes.len() && bytes[i] == b'+' { | ||
| return ToolCallArgumentsFormat::String; | ||
| } |
There was a problem hiding this comment.
Detect unfiltered arguments access as string format
The new detector only returns String when an arguments access is followed by +, but many templates can still expect a JSON string without concatenation (for example {{ tool['function']['arguments'] }} or passing it to a string helper). In those cases this code classifies as Dict, so downstream message processing converts arguments into an object and the template receives the wrong type, which can change rendered prompt format or trigger runtime type errors in string-oriented template logic.
Useful? React with 👍 / 👎.
| while let Some(idx) = template[pos..].find("arguments") { | ||
| let start = pos + idx + "arguments".len(); | ||
| let mut i = start; | ||
| // Skip closing syntax chars: ] ' " ) } and whitespace | ||
| while i < bytes.len() |
There was a problem hiding this comment.
Restrict arguments scan to Jinja expressions, not raw text
This loop scans raw template text for any arguments substring and then checks for a following + without parsing Jinja context, so string literals/comments in prompt text can produce false String detection. A template that actually expects dict arguments (uses |tojson) but contains incidental arguments + ... text will skip process_tool_call_arguments, causing arguments to stay as JSON strings and be double-encoded in rendered output.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to detect the expected format of tool call arguments in Jinja2 chat templates, specifically distinguishing between parsed dictionaries (the Transformers standard) and raw JSON strings (used by DeepSeek V3/R1). The changes include a new ToolCallArgumentsFormat enum, heuristic-based detection logic in chat_template.rs, and updates to the Tokenizer trait and model gateway to conditionally handle argument transformation. Feedback focuses on improving the robustness of the string-scanning heuristic by adding word boundary checks for the 'arguments' keyword and correcting the documentation to accurately reflect the implemented detection logic.
| while let Some(idx) = template[pos..].find("arguments") { | ||
| let start = pos + idx + "arguments".len(); | ||
| let mut i = start; | ||
| // Skip closing syntax chars: ] ' " ) } and whitespace | ||
| while i < bytes.len() | ||
| && matches!( | ||
| bytes[i], | ||
| b']' | b'\'' | b'"' | b')' | b'}' | b' ' | b'\t' | b'\n' | b'\r' | ||
| ) | ||
| { | ||
| i += 1; | ||
| } | ||
| if i < bytes.len() && bytes[i] == b'+' { | ||
| return ToolCallArgumentsFormat::String; | ||
| } | ||
| pos = start; | ||
| } |
There was a problem hiding this comment.
The string scan for the arguments keyword lacks a word boundary check for the preceding character. This could lead to false positives if the template contains variables with names like my_arguments or tool_arguments that happen to be followed by a + operator.
Adding a check to ensure the character immediately preceding the match is not alphanumeric or an underscore would make the detection more robust.
while let Some(idx) = template[pos..].find("arguments") {
let match_pos = pos + idx;
let start = match_pos + "arguments".len();
// Ensure "arguments" is a whole word by checking the preceding character
let is_word_boundary = match_pos == 0 || {
let prev = bytes[match_pos - 1];
!prev.is_ascii_alphanumeric() && prev != b'_'
};
if is_word_boundary {
let mut i = start;
// Skip closing syntax chars: ] ' " ) } and whitespace
while i < bytes.len()
&& matches!(
bytes[i],
b']' | b'\'' | b'"' | b')' | b'}' | b' ' | b'\t' | b'\n' | b'\r'
)
{
i += 1;
}
if i < bytes.len() && bytes[i] == b'+' {
return ToolCallArgumentsFormat::String;
}
}
pos = start;
}There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/tokenizer/src/chat_template.rs`:
- Around line 58-67: Update the doc comment for
detect_tool_call_arguments_format to match the implemented heuristic: explain
that the function returns ToolCallArgumentsFormat::String only when a specific
concatenation pattern of `arguments` is detected in the template, otherwise it
returns ToolCallArgumentsFormat::Dict (including the case where `arguments` is
present without concat filters and the case where `arguments` is absent).
Mention the concat-detection behavior used by detect_tool_call_arguments_format
so future readers/tests understand the implemented decision rule.
- Around line 89-116: The detector detect_arguments_format_from_text is matching
any "arguments" token and can misclassify non-tool text as String; restrict
matches to accessor forms by, after finding "arguments", scanning backward from
the match start to the previous non-whitespace byte and require that that byte
is an accessor indicator (e.g., b'.', b']', b'\'' or b'"') and/or that the token
is enclosed in bracket/quote accessors (e.g., ["arguments"] or ['arguments'] or
.arguments) before applying the existing forward-scan for the '+'; update the
loop that finds "arguments" to perform this backward check (referencing
detect_arguments_format_from_text and preserve fallback to
ToolCallArgumentsFormat::Dict and the downstream process_tool_call_arguments
behavior).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: ad75ecd2-a18c-4f0b-95e6-5002b909a437
📒 Files selected for processing (7)
crates/tokenizer/src/cache/mod.rscrates/tokenizer/src/chat_template.rscrates/tokenizer/src/huggingface.rscrates/tokenizer/src/tiktoken.rscrates/tokenizer/src/traits.rsmodel_gateway/src/routers/grpc/utils/chat_utils.rsmodel_gateway/src/routers/grpc/utils/message_utils.rs
Replace the string-scan heuristic with AST-based detection integrated into the existing Detector. The AST walk looks for BinOp(Add|Concat) nodes where either operand accesses "arguments" — this is precise and immune to false positives from prose text, comments, or cross-expression boundaries. Falls back to a minimal string scan when AST parsing fails (e.g. incomplete template snippets). Addresses reviewer feedback about word boundaries, expression boundary crossing, and non-tool "arguments" text in templates. Signed-off-by: Chang Su <chang.s.su@oracle.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b610b28a20
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| Stmt::EmitExpr(e) => { | ||
| self.inspect_expr_for_structure(&e.expr); | ||
| if !self.args_concat { | ||
| self.check_expr_for_args_concat(&e.expr); | ||
| } |
There was a problem hiding this comment.
Scan set expressions for arguments concatenation
The detector only calls check_expr_for_args_concat in Stmt::EmitExpr, so concatenation inside {% set ... %} is never seen. A template like {% set rendered = tool['function']['arguments'] + '\n' %}{{ rendered }} will be classified as Dict, and process_tool_call_arguments will convert arguments to an object, causing a runtime type error when the template executes the + on a dict.
Useful? React with 👍 / 👎.
| Expr::Const(_) => {} | ||
| _ => {} | ||
| } |
There was a problem hiding this comment.
Recurse through wrapped expressions in concat scan
check_expr_for_args_concat stops on any non-BinOp node, so it misses concatenation when a binop is wrapped (for example {{ (tool['function']['arguments'] + '\n')|trim }} where the root is a filter). Those templates are misdetected as Dict, which again leads to dict/string concat failures during rendering.
Useful? React with 👍 / 👎.
| #[test] | ||
| fn test_args_format_string_concat_fallback() { | ||
| // Incomplete template snippet that fails AST parse → exercises fallback | ||
| let template = "{%- for tool in message['tool_calls'] %}{{'```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```'}}{%- endfor %}"; |
There was a problem hiding this comment.
🟡 Nit: This template is syntactically valid Jinja2 (a top-level for loop is fine — message being undefined is a runtime error, not a parse error), so parse() will succeed and the AST path handles this, not the fallback.
This means detect_arguments_format_fallback has zero test coverage. If you want to actually exercise the fallback, you'd need a template with genuinely broken syntax, e.g.:
| let template = "{%- for tool in message['tool_calls'] %}{{'```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```'}}{%- endfor %}"; | |
| // Template with broken syntax that fails AST parse → exercises fallback | |
| let template = "{%- for tool in message['tool_calls'] %}{{'```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```'}}{%- endfor %}{%"; |
(Note the trailing {% which makes the template unparseable.)
| { | ||
| i += 1; | ||
| } | ||
| if i < bytes.len() && bytes[i] == b'+' { |
There was a problem hiding this comment.
🟡 Nit: The AST path checks for both + and ~ (Jinja's concat operator) via BinOpKind::Add | BinOpKind::Concat, but this fallback only checks for +. If a template uses ~ for string concatenation (e.g. tool.arguments ~ '\n') and also fails AST parsing, this would incorrectly return Dict.
Very unlikely in practice (DeepSeek uses +, and AST parse failures are rare), but worth noting the inconsistency. A quick fix if desired:
if i < bytes.len() && (bytes[i] == b'+' || bytes[i] == b'~') {There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/tokenizer/src/chat_template.rs`:
- Around line 284-288: The args-concat detector only runs in the Stmt::EmitExpr
branch and only descends BinOp, so wrapped/filter/assigned uses of
tool['function']['arguments'] are missed; update check_expr_for_args_concat to
recursively handle Expr::Filter, Expr::GetAttr, Expr::GetItem, Expr::Test, and
Expr::UnaryOp following the traversal pattern used by
inspect_expr_for_structure, and call check_expr_for_args_concat from non-emit
statements where expressions may assign or wrap values (e.g., in Set/Assign or
when inspecting a rendered variable) so patterns like
"(tool['function']['arguments'] ~ '\n')|trim" and "{% set rendered =
tool['function']['arguments'] ~ '\n' %}{{ rendered }}" are detected as String.
- Around line 206-229: tree_accesses_arguments currently descends into every
Expr::Filter and will treat values wrapped by serialization filters (e.g.,
tojson or items) as accessing "arguments", causing dict-style expressions to be
misclassified; modify tree_accesses_arguments so that when matching Expr::Filter
it first checks the filter name and returns false (skip recursion) for filters
"tojson" and "items" (or their equivalent identifiers) and only otherwise
recurses into f.expr, and add a regression test exercising the concatenated
tojson case like the example in the comment to ensure dict rendering is
preserved; update any related behavior in is_args_concat if needed to rely on
the corrected tree_accesses_arguments.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 905ad891-9d07-4834-93d0-375b1ff2f7a1
📒 Files selected for processing (1)
crates/tokenizer/src/chat_template.rs
| /// Recursively check if any sub-expression in a tree accesses "arguments" | ||
| fn tree_accesses_arguments(expr: &Expr) -> bool { | ||
| if Self::accesses_arguments(expr) { | ||
| return true; | ||
| } | ||
| match expr { | ||
| Expr::BinOp(op) => { | ||
| Self::tree_accesses_arguments(&op.left) || Self::tree_accesses_arguments(&op.right) | ||
| } | ||
| Expr::Filter(f) => f | ||
| .expr | ||
| .as_ref() | ||
| .is_some_and(|e| Self::tree_accesses_arguments(e)), | ||
| Expr::GetAttr(g) => Self::tree_accesses_arguments(&g.expr), | ||
| Expr::GetItem(g) => Self::tree_accesses_arguments(&g.expr), | ||
| _ => false, | ||
| } | ||
| } | ||
|
|
||
| /// Check if a BinOp uses `+` or `~` with an `arguments` access on either side | ||
| fn is_args_concat(op: &minijinja::machinery::ast::BinOp) -> bool { | ||
| matches!(op.op, BinOpKind::Add | BinOpKind::Concat) | ||
| && (Self::tree_accesses_arguments(&op.left) || Self::tree_accesses_arguments(&op.right)) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's understand the structure and context of the chat_template.rs file
wc -l crates/tokenizer/src/chat_template.rsRepository: lightseekorg/smg
Length of output: 101
🏁 Script executed:
# Get the full context around lines 206-229
sed -n '180,250p' crates/tokenizer/src/chat_template.rs | cat -nRepository: lightseekorg/smg
Length of output: 3209
🏁 Script executed:
# Search for ToolCallArgumentsFormat enum and understand the String vs Dict distinction
rg 'enum ToolCallArgumentsFormat|struct.*arguments' crates/tokenizer/src/chat_template.rs -A 10Repository: lightseekorg/smg
Length of output: 444
🏁 Script executed:
# Look for process_tool_call_arguments function to understand downstream impact
rg 'fn process_tool_call_arguments' crates/tokenizer/src/chat_template.rs -A 15Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
# Check what filters are already handled or excluded
rg 'Filter|tojson|items' crates/tokenizer/src/chat_template.rs | head -20Repository: lightseekorg/smg
Length of output: 1124
🏁 Script executed:
# Search for test cases related to tojson or filter patterns
rg 'test.*tojson|test.*filter|tojson.*test' crates/tokenizer/src/chat_template.rs -B 2 -A 5Repository: lightseekorg/smg
Length of output: 463
🏁 Script executed:
# Examine the Expr::Filter variant and its structure in the minijinja imports
rg 'use.*minijinja|Filter' crates/tokenizer/src/chat_template.rs | head -20Repository: lightseekorg/smg
Length of output: 175
🏁 Script executed:
# Check the Filter struct definition from minijinja to understand the name field
rg 'struct Filter' crates/tokenizer/src/chat_template.rs -A 5Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
# Search for how Filter names are accessed elsewhere in the code
rg 'Filter.*name|f\.name' crates/tokenizer/src/chat_template.rsRepository: lightseekorg/smg
Length of output: 99
🏁 Script executed:
# Look for any tests with concatenated tojson or items patterns
rg 'test.*concat|tojson.*~|~.*tojson' crates/tokenizer/src/chat_template.rs -B 3 -A 5Repository: lightseekorg/smg
Length of output: 923
🏁 Script executed:
# Check how the detector works - what triggers String classification
rg 'ToolCallArgumentsFormat::String|args_concat.*true' crates/tokenizer/src/chat_template.rs -B 2 -A 2Repository: lightseekorg/smg
Length of output: 839
🏁 Script executed:
# Get the full detector logic to understand when it sets args_concat
sed -n '100,300p' crates/tokenizer/src/chat_template.rs | cat -nRepository: lightseekorg/smg
Length of output: 8899
🏁 Script executed:
# Look for process_tool_call_arguments to verify downstream double-encoding risk
rg 'fn process_tool_call_arguments|process_tool_call_arguments' crates/tokenizer/src/chat_template.rs -B 2 -A 10Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
# Check what other filters exist in the codebase that might affect arguments
rg 'Filter|filter' crates/tokenizer/src/chat_template.rs | grep -i 'tojson\|items\|json' | head -10Repository: lightseekorg/smg
Length of output: 530
🏁 Script executed:
# Verify the exact line numbers and structure of the filter match to ensure proposed diff is correct
sed -n '108,125p' crates/tokenizer/src/chat_template.rs | cat -nRepository: lightseekorg/smg
Length of output: 824
🏁 Script executed:
# Look for whether there's any special handling for dict-style templates
rg 'Dict.*tojson|items.*dict' crates/tokenizer/src/chat_template.rs -B 2 -A 2Repository: lightseekorg/smg
Length of output: 42
Add regression test for concatenated |tojson and exclude dict-oriented filters from raw-argument detection.
The tree_accesses_arguments() function walks into all filters, so a dict-style expression like {{ '{"arguments": ' ~ tool['function']['arguments']|tojson ~ '}' }} is incorrectly classified as String because the recursion finds arguments inside the tojson filter. This skips downstream Dict processing. Exclude tojson and items filters (which explicitly serialize for Dict rendering) by adding a guard clause, and add a regression test for the concatenated |tojson case.
Suggested fix
/// Recursively check if any sub-expression in a tree accesses "arguments"
fn tree_accesses_arguments(expr: &Expr) -> bool {
if Self::accesses_arguments(expr) {
return true;
}
match expr {
+ Expr::Filter(f) if matches!(f.name, "tojson" | "items") => false,
Expr::BinOp(op) => {
Self::tree_accesses_arguments(&op.left) || Self::tree_accesses_arguments(&op.right)
}
Expr::Filter(f) => f
.expr
.as_ref()
.is_some_and(|e| Self::tree_accesses_arguments(e)),
Expr::GetAttr(g) => Self::tree_accesses_arguments(&g.expr),
Expr::GetItem(g) => Self::tree_accesses_arguments(&g.expr),
_ => false,
}
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/tokenizer/src/chat_template.rs` around lines 206 - 229,
tree_accesses_arguments currently descends into every Expr::Filter and will
treat values wrapped by serialization filters (e.g., tojson or items) as
accessing "arguments", causing dict-style expressions to be misclassified;
modify tree_accesses_arguments so that when matching Expr::Filter it first
checks the filter name and returns false (skip recursion) for filters "tojson"
and "items" (or their equivalent identifiers) and only otherwise recurses into
f.expr, and add a regression test exercising the concatenated tojson case like
the example in the comment to ensure dict rendering is preserved; update any
related behavior in is_args_concat if needed to rely on the corrected
tree_accesses_arguments.
| Stmt::EmitExpr(e) => { | ||
| self.inspect_expr_for_structure(&e.expr); | ||
| if !self.args_concat { | ||
| self.check_expr_for_args_concat(&e.expr); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, let's find and examine the chat_template.rs file
fd -t f "chat_template.rs" | head -5Repository: lightseekorg/smg
Length of output: 97
🏁 Script executed:
# Read the specific lines mentioned in the review
cat -n crates/tokenizer/src/chat_template.rs | sed -n '280,295p'Repository: lightseekorg/smg
Length of output: 786
🏁 Script executed:
# Also check lines 363-379 mentioned at the end
cat -n crates/tokenizer/src/chat_template.rs | sed -n '360,380p'Repository: lightseekorg/smg
Length of output: 832
🏁 Script executed:
# Find the check_expr_for_args_concat function definition
rg -n "fn check_expr_for_args_concat" crates/tokenizer/src/chat_template.rs -A 30Repository: lightseekorg/smg
Length of output: 1257
🏁 Script executed:
# Check for expression types like Filter, GetAttr, GetItem
rg -n "enum Expr|Filter|GetAttr|GetItem" crates/tokenizer/src/chat_template.rs | head -20Repository: lightseekorg/smg
Length of output: 810
🏁 Script executed:
# Check where check_expr_for_args_concat is called
rg -n "check_expr_for_args_concat" crates/tokenizer/src/chat_template.rsRepository: lightseekorg/smg
Length of output: 324
🏁 Script executed:
# Look at Stmt::Set handling more broadly
cat -n crates/tokenizer/src/chat_template.rs | sed -n '289,310p'Repository: lightseekorg/smg
Length of output: 998
🏁 Script executed:
# Check the Expr enum definition
rg -n "^enum Expr|^ [A-Z][a-zA-Z]*\(" crates/tokenizer/src/chat_template.rs | head -30Repository: lightseekorg/smg
Length of output: 140
🏁 Script executed:
# Understand the broader context - what is this detector trying to do?
cat -n crates/tokenizer/src/chat_template.rs | sed -n '250,270p'Repository: lightseekorg/smg
Length of output: 1056
🏁 Script executed:
# Find the Expr enum definition
rg -n "^enum Expr" crates/tokenizer/src/chat_template.rs -A 50 | head -60Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
# Check if there's any filter type or similar wrapper
rg -n "pub enum Expr|pub struct Filter|pub struct GetAttr|pub struct GetItem" crates/tokenizer/src/ -A 3 | head -50Repository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
# Let's also check what functions exist for walking/inspecting expressions
rg -n "fn inspect_expr|fn tree_accesses" crates/tokenizer/src/chat_template.rs | head -20Repository: lightseekorg/smg
Length of output: 181
🏁 Script executed:
# Look at the test cases to understand what detection should work
rg -n "test|detect_tool_call_arguments_format" crates/tokenizer/src/chat_template.rs | head -20Repository: lightseekorg/smg
Length of output: 922
🏁 Script executed:
# Look at the inspect_expr_for_structure function in detail
cat -n crates/tokenizer/src/chat_template.rs | sed -n '311,360p'Repository: lightseekorg/smg
Length of output: 2767
🏁 Script executed:
# Check the is_args_concat function to understand what it's checking
rg -n "fn is_args_concat" crates/tokenizer/src/chat_template.rs -A 15Repository: lightseekorg/smg
Length of output: 701
🏁 Script executed:
# Look for test cases that test args_concat detection
rg -n "args_concat|arguments.*concat|concat.*arguments" crates/tokenizer/src/chat_template.rs -B 2 -A 2Repository: lightseekorg/smg
Length of output: 3806
Handle arguments concatenation in filtered and wrapped expressions.
The args-concat detector only runs from Stmt::EmitExpr and only descends through Expr::BinOp, leaving wrapper expressions unchecked. Templates like {{ (tool['function']['arguments'] ~ '\n')|trim }} and {% set rendered = tool['function']['arguments'] ~ '\n' %}{{ rendered }} will still be misclassified as Dict. Both patterns are valid Jinja2 and expect raw strings, causing the same type mismatch that blocks DeepSeek V3.1 templates.
Extend check_expr_for_args_concat() to recurse through Filter, GetAttr, GetItem, Test, and UnaryOp (following the pattern already used in inspect_expr_for_structure()), and also call it from non-emit statements when arguments may be assigned or wrapped.
🧪 Regression cases
assert_eq!(
detect_tool_call_arguments_format(
"{{ (tool['function']['arguments'] ~ '\\n')|trim }}"
),
ToolCallArgumentsFormat::String,
);
assert_eq!(
detect_tool_call_arguments_format(
"{% set rendered = tool['function']['arguments'] ~ '\\n' %}{{ rendered }}"
),
ToolCallArgumentsFormat::String,
);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/tokenizer/src/chat_template.rs` around lines 284 - 288, The
args-concat detector only runs in the Stmt::EmitExpr branch and only descends
BinOp, so wrapped/filter/assigned uses of tool['function']['arguments'] are
missed; update check_expr_for_args_concat to recursively handle Expr::Filter,
Expr::GetAttr, Expr::GetItem, Expr::Test, and Expr::UnaryOp following the
traversal pattern used by inspect_expr_for_structure, and call
check_expr_for_args_concat from non-emit statements where expressions may assign
or wrap values (e.g., in Set/Assign or when inspecting a rendered variable) so
patterns like "(tool['function']['arguments'] ~ '\n')|trim" and "{% set rendered
= tool['function']['arguments'] ~ '\n' %}{{ rendered }}" are detected as String.
|
Hi @CatherineSue, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch: git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease |
Description
Problem
process_tool_call_arguments()unconditionally converts tool_call arguments from JSON string to dict before passing them to the chat template. This follows the Transformers/HuggingFace guidance that templates should expect dict arguments.However, DeepSeek V3/R1 family templates use plain string concatenation for arguments (e.g.
tool['function']['arguments'] + '\n'), which crashes when given a dict:vLLM has the same issue and works around it by shipping override templates that add
|tojson. This requires users to manually pass--chat-template.Solution
Detect the expected arguments format from the chat template at tokenizer init time and conditionally apply
process_tool_call_arguments().An exhaustive audit of 25+ model chat templates across
/raid/models, HuggingFace, and e2e test models found:Since dict is the overwhelming majority, the detection is flipped: detect the string-concat pattern (
argumentsfollowed by+) and default to dict. This is a ~20-line O(n) string scan, no regex or AST parsing needed.Changes
crates/tokenizer/src/chat_template.rs: AddToolCallArgumentsFormatenum (Dict|String) anddetect_tool_call_arguments_format()detection function. Integrate intoChatTemplateStatealongside existing content format detection.crates/tokenizer/src/traits.rs: Addtool_call_arguments_format()toTokenizertrait.crates/tokenizer/src/huggingface.rs,tiktoken.rs,cache/mod.rs: Implement the new trait method.model_gateway/src/routers/grpc/utils/chat_utils.rs: Makeprocess_tool_call_arguments()conditional onToolCallArgumentsFormat::Dict.model_gateway/src/routers/grpc/utils/message_utils.rs: Same conditional for Messages API path.Test Plan
/raid/modelsand HuggingFace (DeepSeek-V3-0324, V3.1, R1-0528, R1-Distill-Qwen-7B, R1-Distill-Qwen-32B all correctly detected as String).Checklist
cargo +nightly fmtpassescargo clippy --all-targets --all-features -- -D warningspassesSummary by CodeRabbit
New Features
Improvements
Tests