feat(tokenizer): detect tool call arguments format from chat template by CatherineSue · Pull Request #1028 · lightseekorg/smg

CatherineSue · 2026-04-02T19:15:55Z

Description

Problem

process_tool_call_arguments() unconditionally converts tool_call arguments from JSON string to dict before passing them to the chat template. This follows the Transformers/HuggingFace guidance that templates should expect dict arguments.

However, DeepSeek V3/R1 family templates use plain string concatenation for arguments (e.g. tool['function']['arguments'] + '\n'), which crashes when given a dict:

can only concatenate str (not "dict") to str

vLLM has the same issue and works around it by shipping override templates that add |tojson. This requires users to manually pass --chat-template.

Solution

Detect the expected arguments format from the chat template at tokenizer init time and conditionally apply process_tool_call_arguments().

An exhaustive audit of 25+ model chat templates across /raid/models, HuggingFace, and e2e test models found:

22 models expect dict arguments (Llama, Qwen, GLM, MiniMax, Nemotron, Kimi-K2, gpt-oss, etc.)
3 models expect string arguments (all DeepSeek: V3-0324, R1-0528, R1-Distill)

Since dict is the overwhelming majority, the detection is flipped: detect the string-concat pattern (arguments followed by +) and default to dict. This is a ~20-line O(n) string scan, no regex or AST parsing needed.

Changes

crates/tokenizer/src/chat_template.rs: Add ToolCallArgumentsFormat enum (Dict | String) and detect_tool_call_arguments_format() detection function. Integrate into ChatTemplateState alongside existing content format detection.
crates/tokenizer/src/traits.rs: Add tool_call_arguments_format() to Tokenizer trait.
crates/tokenizer/src/huggingface.rs, tiktoken.rs, cache/mod.rs: Implement the new trait method.
model_gateway/src/routers/grpc/utils/chat_utils.rs: Make process_tool_call_arguments() conditional on ToolCallArgumentsFormat::Dict.
model_gateway/src/routers/grpc/utils/message_utils.rs: Same conditional for Messages API path.

Test Plan

4 unit tests covering: no arguments keyword (→Dict), real DeepSeek template snippet (→String), tojson filter (→Dict), direct dict iteration (→Dict).
Validated detection against all models in /raid/models and HuggingFace (DeepSeek-V3-0324, V3.1, R1-0528, R1-Distill-Qwen-7B, R1-Distill-Qwen-32B all correctly detected as String).

Checklist

cargo +nightly fmt passes
cargo clippy --all-targets --all-features -- -D warnings passes
(Optional) Documentation updated
(Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

New Features
- Detect and support two tool-call argument formats in chat templates: dict and string.
- Tokenizers now expose the detected tool-call argument format.
Improvements
- Message processing now conditionally parses tool-call arguments only when the detected format requires it.
Tests
- Added unit tests covering format detection and fallback behaviors.

Add ToolCallArgumentsFormat enum (Dict vs String) that detects whether a chat template expects tool_call arguments as parsed dicts (using |tojson, |items filters) or as raw JSON strings (plain concatenation). DeepSeek V3.1/R1 templates use plain string concat for arguments, while Llama, Qwen, and most modern templates follow the Transformers convention of expecting dicts with |tojson rendering. Detection runs once at tokenizer init alongside content format detection. process_tool_call_arguments() is now conditional — it only converts string→dict when the template expects Dict format. Signed-off-by: Chang Su <chang.s.su@oracle.com>

…guments detection Add two additional patterns to detect_arguments_format_from_text: 1. Direct dict iteration (Llama 4): `for param in tool_call.arguments` 2. Alias-then-method (GLM, MiniMax): `set _args = arguments` + `.items()` These were false negatives found during exhaustive audit of 25+ model chat templates across /raid/models, HuggingFace, and e2e test models. Signed-off-by: Chang Su <chang.s.su@oracle.com>

…ates The direct dict iteration check (Pattern 2) scanned backwards from "arguments" to the last newline, then checked if "for " and " in " appeared anywhere in that prefix. On single-line templates like DeepSeek V3.1, the prefix spans the entire template and matches unrelated for-loops (e.g. "for tool in message['tool_calls']"). Fix: walk backwards from "arguments" over the dot-access prefix (e.g. "tool_call.") and verify " in " immediately precedes it, rather than checking the entire line. Signed-off-by: Chang Su <chang.s.su@oracle.com>

…ring-concat only Instead of detecting all possible dict patterns (|tojson, |items, direct iteration, alias+.items()) and defaulting to string, flip the logic: detect the string-concat pattern (arguments followed by +) and default to dict. Dict is the Transformers convention used by 22+ models (Llama, Qwen, GLM, MiniMax, Nemotron, etc.). Only DeepSeek V3/R1 family uses string concatenation. Detecting the minority case is simpler and less fragile. Signed-off-by: Chang Su <chang.s.su@oracle.com>

Remove test_args_format_string_concat_deepseek (covered by the real template snippet test) and test_args_format_empty_template (trivial early return). Signed-off-by: Chang Su <chang.s.su@oracle.com>

coderabbitai · 2026-04-02T19:16:10Z

📝 Walkthrough

Walkthrough

Adds detection of tool-call argument formatting (Dict vs String) to chat templates, exposes it via a new Tokenizer trait method, implements the method across tokenizers, and conditions model-gateway preprocessing of tool-call arguments on the detected format.

Changes

Cohort / File(s)	Summary
Chat Template Argument Format Detection `crates/tokenizer/src/chat_template.rs`	Added `ToolCallArgumentsFormat` enum, `detect_tool_call_arguments_format()` / `detect_template_formats()` logic, AST and fallback detection for argument concatenation, `ChatTemplateState.arguments_format` field and accessor, and unit tests.
Tokenizer Trait & Base Implementation `crates/tokenizer/src/traits.rs`	Added `fn tool_call_arguments_format(&self) -> ToolCallArgumentsFormat` to `Tokenizer` trait with a default implementation.
Tokenizer Implementations / Cache Delegation `crates/tokenizer/src/cache/mod.rs`, `crates/tokenizer/src/huggingface.rs`, `crates/tokenizer/src/tiktoken.rs`	Implemented `tool_call_arguments_format()` for `CachedTokenizer`, `HuggingFaceTokenizer`, and `TiktokenTokenizer`, delegating to their `ChatTemplateState`/inner tokenizer.
Model Gateway Conditional Processing `model_gateway/src/routers/grpc/utils/chat_utils.rs`, `model_gateway/src/routers/grpc/utils/message_utils.rs`	Tool-call argument preprocessing (string→JSON parsing) is now executed only when `tokenizer.tool_call_arguments_format()` returns `ToolCallArgumentsFormat::Dict`; otherwise parsing is skipped.

Sequence Diagram

sequenceDiagram
    participant Client as Client
    participant Gateway as Model Gateway
    participant Tokenizer as Tokenizer
    participant Template as ChatTemplateState
    participant Processor as Argument Processor

    Client->>Gateway: process_chat_messages(messages, tokenizer)
    Gateway->>Tokenizer: tool_call_arguments_format()
    Tokenizer->>Template: arguments_format()
    Template-->>Tokenizer: ToolCallArgumentsFormat
    Tokenizer-->>Gateway: Format (Dict or String)
    
    alt Format is Dict
        Gateway->>Processor: process_tool_call_arguments(transformed_messages)
        Processor->>Processor: Parse string -> JSON
        Processor-->>Gateway: Processed messages
    else Format is String
        Gateway-->>Gateway: Skip argument parsing
    end
    
    Gateway-->>Client: Return processed messages

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

feat(tokenizer): add tiktoken hub model support and unify chat template API #439: touches tokenizer trait and CachedTokenizer delegation; related by adding/propagating trait methods.
fix(tokenizer): chat template error handling + OpenAI o-series detection #506: modifies chat_template state and detection logic; related by extending template analysis and state fields.

Suggested labels

tests

Suggested reviewers

key4ng
whybeyoung

Poem

🐰 I sniffed the template, Dict or String,
I hopped through tokens to hear the ping,
Traits now ask, the gateway obeys,
Parsers parse only in proper ways. 🥕

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding detection of tool call arguments format from chat templates and implementing conditional processing based on the detected format.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/template-args-format-detection-v3

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-04-02T19:19:12Z

crates/tokenizer/src/chat_template.rs

+///
+/// Inspects the template text for how `arguments` is used:
+/// - If a dict-consuming filter (`|tojson`, `|items`, `|dictsort`, `|tojson(...)`) is applied,
+///   the template expects arguments as dicts → return `Dict`.
+/// - If `arguments` appears in the template without such filters (plain string concatenation),
+///   the template expects arguments as JSON strings → return `String`.
+/// - If `arguments` does not appear at all (no tool support), default to `Dict`


🟡 Nit: This docstring describes a two-pronged detection (positive detection of |tojson/|items/|dictsort → Dict, absence → String), but the implementation in detect_arguments_format_from_text doesn't check for those filters at all. It only detects the + string-concat pattern and defaults to Dict for everything else.

The result is the same, but a reader of this doc would expect filter-matching logic that doesn't exist. Consider updating to match the actual strategy:

Suggested change

///

/// Inspects the template text for how `arguments` is used:

/// - If a dict-consuming filter (`|tojson`, `|items`, `|dictsort`, `|tojson(...)`) is applied,

/// the template expects arguments as dicts → return `Dict`.

/// - If `arguments` appears in the template without such filters (plain string concatenation),

/// the template expects arguments as JSON strings → return `String`.

/// - If `arguments` does not appear at all (no tool support), default to `Dict`

/// Inspects the template text for how `arguments` is used:

/// - If `arguments` is followed by `+` (Jinja string concatenation, e.g. DeepSeek V3.1),

/// the template expects arguments as JSON strings → return `String`.

/// - Otherwise (including when `arguments` does not appear at all),

/// default to `Dict` (the Transformers convention).

crates/tokenizer/src/chat_template.rs

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ec6a067c31

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-02T19:20:36Z

crates/tokenizer/src/chat_template.rs

+        if i < bytes.len() && bytes[i] == b'+' {
+            return ToolCallArgumentsFormat::String;
+        }


Detect unfiltered arguments access as string format

The new detector only returns String when an arguments access is followed by +, but many templates can still expect a JSON string without concatenation (for example {{ tool['function']['arguments'] }} or passing it to a string helper). In those cases this code classifies as Dict, so downstream message processing converts arguments into an object and the template receives the wrong type, which can change rendered prompt format or trigger runtime type errors in string-oriented template logic.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-02T19:20:36Z

crates/tokenizer/src/chat_template.rs

+    while let Some(idx) = template[pos..].find("arguments") {
+        let start = pos + idx + "arguments".len();
+        let mut i = start;
+        // Skip closing syntax chars: ] ' " ) } and whitespace
+        while i < bytes.len()


Restrict arguments scan to Jinja expressions, not raw text

This loop scans raw template text for any arguments substring and then checks for a following + without parsing Jinja context, so string literals/comments in prompt text can produce false String detection. A template that actually expects dict arguments (uses |tojson) but contains incidental arguments + ... text will skip process_tool_call_arguments, causing arguments to stay as JSON strings and be double-encoded in rendered output.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request introduces a mechanism to detect the expected format of tool call arguments in Jinja2 chat templates, specifically distinguishing between parsed dictionaries (the Transformers standard) and raw JSON strings (used by DeepSeek V3/R1). The changes include a new ToolCallArgumentsFormat enum, heuristic-based detection logic in chat_template.rs, and updates to the Tokenizer trait and model gateway to conditionally handle argument transformation. Feedback focuses on improving the robustness of the string-scanning heuristic by adding word boundary checks for the 'arguments' keyword and correcting the documentation to accurately reflect the implemented detection logic.

crates/tokenizer/src/chat_template.rs

gemini-code-assist · 2026-04-02T19:21:23Z

crates/tokenizer/src/chat_template.rs

+    while let Some(idx) = template[pos..].find("arguments") {
+        let start = pos + idx + "arguments".len();
+        let mut i = start;
+        // Skip closing syntax chars: ] ' " ) } and whitespace
+        while i < bytes.len()
+            && matches!(
+                bytes[i],
+                b']' | b'\'' | b'"' | b')' | b'}' | b' ' | b'\t' | b'\n' | b'\r'
+            )
+        {
+            i += 1;
+        }
+        if i < bytes.len() && bytes[i] == b'+' {
+            return ToolCallArgumentsFormat::String;
+        }
+        pos = start;
+    }


The string scan for the arguments keyword lacks a word boundary check for the preceding character. This could lead to false positives if the template contains variables with names like my_arguments or tool_arguments that happen to be followed by a + operator.

Adding a check to ensure the character immediately preceding the match is not alphanumeric or an underscore would make the detection more robust.

while let Some(idx) = template[pos..].find("arguments") { let match_pos = pos + idx; let start = match_pos + "arguments".len(); // Ensure "arguments" is a whole word by checking the preceding character let is_word_boundary = match_pos == 0 || { let prev = bytes[match_pos - 1]; !prev.is_ascii_alphanumeric() && prev != b'_' }; if is_word_boundary { let mut i = start; // Skip closing syntax chars: ] ' " ) } and whitespace while i < bytes.len() && matches!( bytes[i], b']' | b'\'' | b'"' | b')' | b'}' | b' ' | b'\t' | b'\n' | b'\r' ) { i += 1; } if i < bytes.len() && bytes[i] == b'+' { return ToolCallArgumentsFormat::String; } } pos = start; }

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/tokenizer/src/chat_template.rs`:
- Around line 58-67: Update the doc comment for
detect_tool_call_arguments_format to match the implemented heuristic: explain
that the function returns ToolCallArgumentsFormat::String only when a specific
concatenation pattern of `arguments` is detected in the template, otherwise it
returns ToolCallArgumentsFormat::Dict (including the case where `arguments` is
present without concat filters and the case where `arguments` is absent).
Mention the concat-detection behavior used by detect_tool_call_arguments_format
so future readers/tests understand the implemented decision rule.
- Around line 89-116: The detector detect_arguments_format_from_text is matching
any "arguments" token and can misclassify non-tool text as String; restrict
matches to accessor forms by, after finding "arguments", scanning backward from
the match start to the previous non-whitespace byte and require that that byte
is an accessor indicator (e.g., b'.', b']', b'\'' or b'"') and/or that the token
is enclosed in bracket/quote accessors (e.g., ["arguments"] or ['arguments'] or
.arguments) before applying the existing forward-scan for the '+'; update the
loop that finds "arguments" to perform this backward check (referencing
detect_arguments_format_from_text and preserve fallback to
ToolCallArgumentsFormat::Dict and the downstream process_tool_call_arguments
behavior).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ad75ecd2-a18c-4f0b-95e6-5002b909a437

📥 Commits

Reviewing files that changed from the base of the PR and between b02ae6b and ec6a067.

📒 Files selected for processing (7)

crates/tokenizer/src/cache/mod.rs
crates/tokenizer/src/chat_template.rs
crates/tokenizer/src/huggingface.rs
crates/tokenizer/src/tiktoken.rs
crates/tokenizer/src/traits.rs
model_gateway/src/routers/grpc/utils/chat_utils.rs
model_gateway/src/routers/grpc/utils/message_utils.rs

crates/tokenizer/src/chat_template.rs

Replace the string-scan heuristic with AST-based detection integrated into the existing Detector. The AST walk looks for BinOp(Add|Concat) nodes where either operand accesses "arguments" — this is precise and immune to false positives from prose text, comments, or cross-expression boundaries. Falls back to a minimal string scan when AST parsing fails (e.g. incomplete template snippets). Addresses reviewer feedback about word boundaries, expression boundary crossing, and non-tool "arguments" text in templates. Signed-off-by: Chang Su <chang.s.su@oracle.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b610b28a20

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-02T23:05:36Z

crates/tokenizer/src/chat_template.rs

            Stmt::EmitExpr(e) => {
                self.inspect_expr_for_structure(&e.expr);
+                if !self.args_concat {
+                    self.check_expr_for_args_concat(&e.expr);
+                }


Scan set expressions for arguments concatenation

The detector only calls check_expr_for_args_concat in Stmt::EmitExpr, so concatenation inside {% set ... %} is never seen. A template like {% set rendered = tool['function']['arguments'] + '\n' %}{{ rendered }} will be classified as Dict, and process_tool_call_arguments will convert arguments to an object, causing a runtime type error when the template executes the + on a dict.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-02T23:05:36Z

crates/tokenizer/src/chat_template.rs

+            Expr::Const(_) => {}
+            _ => {}
+        }


Recurse through wrapped expressions in concat scan

check_expr_for_args_concat stops on any non-BinOp node, so it misses concatenation when a binop is wrapped (for example {{ (tool['function']['arguments'] + '\n')|trim }} where the root is a filter). Those templates are misdetected as Dict, which again leads to dict/string concat failures during rendering.

Useful? React with 👍 / 👎.

claude · 2026-04-02T23:06:53Z

crates/tokenizer/src/chat_template.rs

+    #[test]
+    fn test_args_format_string_concat_fallback() {
+        // Incomplete template snippet that fails AST parse → exercises fallback
+        let template = "{%- for tool in message['tool_calls'] %}{{'```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```'}}{%- endfor %}";


🟡 Nit: This template is syntactically valid Jinja2 (a top-level for loop is fine — message being undefined is a runtime error, not a parse error), so parse() will succeed and the AST path handles this, not the fallback.

This means detect_arguments_format_fallback has zero test coverage. If you want to actually exercise the fallback, you'd need a template with genuinely broken syntax, e.g.:

Suggested change

let template = "{%- for tool in message['tool_calls'] %}{{'```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```'}}{%- endfor %}";

// Template with broken syntax that fails AST parse → exercises fallback

let template = "{%- for tool in message['tool_calls'] %}{{'```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```'}}{%- endfor %}{%";

(Note the trailing {% which makes the template unparseable.)

claude · 2026-04-02T23:06:57Z

crates/tokenizer/src/chat_template.rs

+        {
+            i += 1;
+        }
+        if i < bytes.len() && bytes[i] == b'+' {


🟡 Nit: The AST path checks for both + and ~ (Jinja's concat operator) via BinOpKind::Add | BinOpKind::Concat, but this fallback only checks for +. If a template uses ~ for string concatenation (e.g. tool.arguments ~ '\n') and also fails AST parsing, this would incorrectly return Dict.

Very unlikely in practice (DeepSeek uses +, and AST parse failures are rare), but worth noting the inconsistency. A quick fix if desired:

if i < bytes.len() && (bytes[i] == b'+' || bytes[i] == b'~') {

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/tokenizer/src/chat_template.rs`:
- Around line 284-288: The args-concat detector only runs in the Stmt::EmitExpr
branch and only descends BinOp, so wrapped/filter/assigned uses of
tool['function']['arguments'] are missed; update check_expr_for_args_concat to
recursively handle Expr::Filter, Expr::GetAttr, Expr::GetItem, Expr::Test, and
Expr::UnaryOp following the traversal pattern used by
inspect_expr_for_structure, and call check_expr_for_args_concat from non-emit
statements where expressions may assign or wrap values (e.g., in Set/Assign or
when inspecting a rendered variable) so patterns like
"(tool['function']['arguments'] ~ '\n')|trim" and "{% set rendered =
tool['function']['arguments'] ~ '\n' %}{{ rendered }}" are detected as String.
- Around line 206-229: tree_accesses_arguments currently descends into every
Expr::Filter and will treat values wrapped by serialization filters (e.g.,
tojson or items) as accessing "arguments", causing dict-style expressions to be
misclassified; modify tree_accesses_arguments so that when matching Expr::Filter
it first checks the filter name and returns false (skip recursion) for filters
"tojson" and "items" (or their equivalent identifiers) and only otherwise
recurses into f.expr, and add a regression test exercising the concatenated
tojson case like the example in the comment to ensure dict rendering is
preserved; update any related behavior in is_args_concat if needed to rely on
the corrected tree_accesses_arguments.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 905ad891-9d07-4834-93d0-375b1ff2f7a1

📥 Commits

Reviewing files that changed from the base of the PR and between ec6a067 and b610b28.

📒 Files selected for processing (1)

crates/tokenizer/src/chat_template.rs

coderabbitai · 2026-04-02T23:12:02Z

crates/tokenizer/src/chat_template.rs

+    /// Recursively check if any sub-expression in a tree accesses "arguments"
+    fn tree_accesses_arguments(expr: &Expr) -> bool {
+        if Self::accesses_arguments(expr) {
+            return true;
+        }
+        match expr {
+            Expr::BinOp(op) => {
+                Self::tree_accesses_arguments(&op.left) || Self::tree_accesses_arguments(&op.right)
+            }
+            Expr::Filter(f) => f
+                .expr
+                .as_ref()
+                .is_some_and(|e| Self::tree_accesses_arguments(e)),
+            Expr::GetAttr(g) => Self::tree_accesses_arguments(&g.expr),
+            Expr::GetItem(g) => Self::tree_accesses_arguments(&g.expr),
+            _ => false,
+        }
+    }
+
+    /// Check if a BinOp uses `+` or `~` with an `arguments` access on either side
+    fn is_args_concat(op: &minijinja::machinery::ast::BinOp) -> bool {
+        matches!(op.op, BinOpKind::Add | BinOpKind::Concat)
+            && (Self::tree_accesses_arguments(&op.left) || Self::tree_accesses_arguments(&op.right))
+    }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's understand the structure and context of the chat_template.rs file wc -l crates/tokenizer/src/chat_template.rs

Repository: lightseekorg/smg

Length of output: 101

🏁 Script executed:

# Get the full context around lines 206-229 sed -n '180,250p' crates/tokenizer/src/chat_template.rs | cat -n

Repository: lightseekorg/smg

Length of output: 3209

🏁 Script executed:

# Search for ToolCallArgumentsFormat enum and understand the String vs Dict distinction rg 'enum ToolCallArgumentsFormat|struct.*arguments' crates/tokenizer/src/chat_template.rs -A 10

Repository: lightseekorg/smg

Length of output: 444

🏁 Script executed:

# Look for process_tool_call_arguments function to understand downstream impact rg 'fn process_tool_call_arguments' crates/tokenizer/src/chat_template.rs -A 15

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

# Check what filters are already handled or excluded rg 'Filter|tojson|items' crates/tokenizer/src/chat_template.rs | head -20

Repository: lightseekorg/smg

Length of output: 1124

🏁 Script executed:

# Search for test cases related to tojson or filter patterns rg 'test.*tojson|test.*filter|tojson.*test' crates/tokenizer/src/chat_template.rs -B 2 -A 5

Repository: lightseekorg/smg

Length of output: 463

🏁 Script executed:

# Examine the Expr::Filter variant and its structure in the minijinja imports rg 'use.*minijinja|Filter' crates/tokenizer/src/chat_template.rs | head -20

Repository: lightseekorg/smg

Length of output: 175

🏁 Script executed:

# Check the Filter struct definition from minijinja to understand the name field rg 'struct Filter' crates/tokenizer/src/chat_template.rs -A 5

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

# Search for how Filter names are accessed elsewhere in the code rg 'Filter.*name|f\.name' crates/tokenizer/src/chat_template.rs

Repository: lightseekorg/smg

Length of output: 99

🏁 Script executed:

# Look for any tests with concatenated tojson or items patterns rg 'test.*concat|tojson.*~|~.*tojson' crates/tokenizer/src/chat_template.rs -B 3 -A 5

Repository: lightseekorg/smg

Length of output: 923

🏁 Script executed:

# Check how the detector works - what triggers String classification rg 'ToolCallArgumentsFormat::String|args_concat.*true' crates/tokenizer/src/chat_template.rs -B 2 -A 2

Repository: lightseekorg/smg

Length of output: 839

🏁 Script executed:

# Get the full detector logic to understand when it sets args_concat sed -n '100,300p' crates/tokenizer/src/chat_template.rs | cat -n

Repository: lightseekorg/smg

Length of output: 8899

🏁 Script executed:

# Look for process_tool_call_arguments to verify downstream double-encoding risk rg 'fn process_tool_call_arguments|process_tool_call_arguments' crates/tokenizer/src/chat_template.rs -B 2 -A 10

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

# Check what other filters exist in the codebase that might affect arguments rg 'Filter|filter' crates/tokenizer/src/chat_template.rs | grep -i 'tojson\|items\|json' | head -10

Repository: lightseekorg/smg

Length of output: 530

🏁 Script executed:

# Verify the exact line numbers and structure of the filter match to ensure proposed diff is correct sed -n '108,125p' crates/tokenizer/src/chat_template.rs | cat -n

Repository: lightseekorg/smg

Length of output: 824

🏁 Script executed:

# Look for whether there's any special handling for dict-style templates rg 'Dict.*tojson|items.*dict' crates/tokenizer/src/chat_template.rs -B 2 -A 2

Repository: lightseekorg/smg

Length of output: 42

Add regression test for concatenated |tojson and exclude dict-oriented filters from raw-argument detection.

The tree_accesses_arguments() function walks into all filters, so a dict-style expression like {{ '{"arguments": ' ~ tool['function']['arguments']|tojson ~ '}' }} is incorrectly classified as String because the recursion finds arguments inside the tojson filter. This skips downstream Dict processing. Exclude tojson and items filters (which explicitly serialize for Dict rendering) by adding a guard clause, and add a regression test for the concatenated |tojson case.

Suggested fix

/// Recursively check if any sub-expression in a tree accesses "arguments" fn tree_accesses_arguments(expr: &Expr) -> bool { if Self::accesses_arguments(expr) { return true; } match expr { + Expr::Filter(f) if matches!(f.name, "tojson" | "items") => false, Expr::BinOp(op) => { Self::tree_accesses_arguments(&op.left) || Self::tree_accesses_arguments(&op.right) } Expr::Filter(f) => f .expr .as_ref() .is_some_and(|e| Self::tree_accesses_arguments(e)), Expr::GetAttr(g) => Self::tree_accesses_arguments(&g.expr), Expr::GetItem(g) => Self::tree_accesses_arguments(&g.expr), _ => false, } }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/tokenizer/src/chat_template.rs` around lines 206 - 229, tree_accesses_arguments currently descends into every Expr::Filter and will treat values wrapped by serialization filters (e.g., tojson or items) as accessing "arguments", causing dict-style expressions to be misclassified; modify tree_accesses_arguments so that when matching Expr::Filter it first checks the filter name and returns false (skip recursion) for filters "tojson" and "items" (or their equivalent identifiers) and only otherwise recurses into f.expr, and add a regression test exercising the concatenated tojson case like the example in the comment to ensure dict rendering is preserved; update any related behavior in is_args_concat if needed to rely on the corrected tree_accesses_arguments.

coderabbitai · 2026-04-02T23:12:02Z

crates/tokenizer/src/chat_template.rs

            Stmt::EmitExpr(e) => {
                self.inspect_expr_for_structure(&e.expr);
+                if !self.args_concat {
+                    self.check_expr_for_args_concat(&e.expr);
+                }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, let's find and examine the chat_template.rs file fd -t f "chat_template.rs" | head -5

Repository: lightseekorg/smg

Length of output: 97

🏁 Script executed:

# Read the specific lines mentioned in the review cat -n crates/tokenizer/src/chat_template.rs | sed -n '280,295p'

Repository: lightseekorg/smg

Length of output: 786

🏁 Script executed:

# Also check lines 363-379 mentioned at the end cat -n crates/tokenizer/src/chat_template.rs | sed -n '360,380p'

Repository: lightseekorg/smg

Length of output: 832

🏁 Script executed:

# Find the check_expr_for_args_concat function definition rg -n "fn check_expr_for_args_concat" crates/tokenizer/src/chat_template.rs -A 30

Repository: lightseekorg/smg

Length of output: 1257

🏁 Script executed:

# Check for expression types like Filter, GetAttr, GetItem rg -n "enum Expr|Filter|GetAttr|GetItem" crates/tokenizer/src/chat_template.rs | head -20

Repository: lightseekorg/smg

Length of output: 810

🏁 Script executed:

# Check where check_expr_for_args_concat is called rg -n "check_expr_for_args_concat" crates/tokenizer/src/chat_template.rs

Repository: lightseekorg/smg

Length of output: 324

🏁 Script executed:

# Look at Stmt::Set handling more broadly cat -n crates/tokenizer/src/chat_template.rs | sed -n '289,310p'

Repository: lightseekorg/smg

Length of output: 998

🏁 Script executed:

# Check the Expr enum definition rg -n "^enum Expr|^ [A-Z][a-zA-Z]*\(" crates/tokenizer/src/chat_template.rs | head -30

Repository: lightseekorg/smg

Length of output: 140

🏁 Script executed:

# Understand the broader context - what is this detector trying to do? cat -n crates/tokenizer/src/chat_template.rs | sed -n '250,270p'

Repository: lightseekorg/smg

Length of output: 1056

🏁 Script executed:

# Find the Expr enum definition rg -n "^enum Expr" crates/tokenizer/src/chat_template.rs -A 50 | head -60

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

# Check if there's any filter type or similar wrapper rg -n "pub enum Expr|pub struct Filter|pub struct GetAttr|pub struct GetItem" crates/tokenizer/src/ -A 3 | head -50

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

# Let's also check what functions exist for walking/inspecting expressions rg -n "fn inspect_expr|fn tree_accesses" crates/tokenizer/src/chat_template.rs | head -20

Repository: lightseekorg/smg

Length of output: 181

🏁 Script executed:

# Look at the test cases to understand what detection should work rg -n "test|detect_tool_call_arguments_format" crates/tokenizer/src/chat_template.rs | head -20

Repository: lightseekorg/smg

Length of output: 922

🏁 Script executed:

# Look at the inspect_expr_for_structure function in detail cat -n crates/tokenizer/src/chat_template.rs | sed -n '311,360p'

Repository: lightseekorg/smg

Length of output: 2767

🏁 Script executed:

# Check the is_args_concat function to understand what it's checking rg -n "fn is_args_concat" crates/tokenizer/src/chat_template.rs -A 15

Repository: lightseekorg/smg

Length of output: 701

🏁 Script executed:

# Look for test cases that test args_concat detection rg -n "args_concat|arguments.*concat|concat.*arguments" crates/tokenizer/src/chat_template.rs -B 2 -A 2

Repository: lightseekorg/smg

Length of output: 3806

Handle arguments concatenation in filtered and wrapped expressions.

The args-concat detector only runs from Stmt::EmitExpr and only descends through Expr::BinOp, leaving wrapper expressions unchecked. Templates like {{ (tool['function']['arguments'] ~ '\n')|trim }} and {% set rendered = tool['function']['arguments'] ~ '\n' %}{{ rendered }} will still be misclassified as Dict. Both patterns are valid Jinja2 and expect raw strings, causing the same type mismatch that blocks DeepSeek V3.1 templates.

Extend check_expr_for_args_concat() to recurse through Filter, GetAttr, GetItem, Test, and UnaryOp (following the pattern already used in inspect_expr_for_structure()), and also call it from non-emit statements when arguments may be assigned or wrapped.

🧪 Regression cases

assert_eq!( detect_tool_call_arguments_format( "{{ (tool['function']['arguments'] ~ '\\n')|trim }}" ), ToolCallArgumentsFormat::String, ); assert_eq!( detect_tool_call_arguments_format( "{% set rendered = tool['function']['arguments'] ~ '\\n' %}{{ rendered }}" ), ToolCallArgumentsFormat::String, );

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/tokenizer/src/chat_template.rs` around lines 284 - 288, The args-concat detector only runs in the Stmt::EmitExpr branch and only descends BinOp, so wrapped/filter/assigned uses of tool['function']['arguments'] are missed; update check_expr_for_args_concat to recursively handle Expr::Filter, Expr::GetAttr, Expr::GetItem, Expr::Test, and Expr::UnaryOp following the traversal pattern used by inspect_expr_for_structure, and call check_expr_for_args_concat from non-emit statements where expressions may assign or wrap values (e.g., in Set/Assign or when inspecting a rendered variable) so patterns like "(tool['function']['arguments'] ~ '\n')|trim" and "{% set rendered = tool['function']['arguments'] ~ '\n' %}{{ rendered }}" are detected as String.

mergify · 2026-04-03T18:26:14Z

Hi @CatherineSue, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch:

git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease

CatherineSue added 5 commits April 2, 2026 17:08

test(tokenizer): remove redundant arguments format tests

ec6a067

Remove test_args_format_string_concat_deepseek (covered by the real template snippet test) and test_args_format_empty_template (trivial early return). Signed-off-by: Chang Su <chang.s.su@oracle.com>

CatherineSue requested review from key4ng and slin1237 as code owners April 2, 2026 19:15

github-actions bot added tokenizer Tokenizer related changes grpc gRPC client and router changes model-gateway Model gateway crate changes labels Apr 2, 2026

claude bot reviewed Apr 2, 2026

View reviewed changes

crates/tokenizer/src/chat_template.rs Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Apr 2, 2026

View reviewed changes

gemini-code-assist bot reviewed Apr 2, 2026

View reviewed changes

coderabbitai bot requested changes Apr 2, 2026

View reviewed changes

crates/tokenizer/src/chat_template.rs Show resolved Hide resolved

crates/tokenizer/src/chat_template.rs Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Apr 2, 2026

View reviewed changes

claude bot reviewed Apr 2, 2026

View reviewed changes

coderabbitai bot requested changes Apr 2, 2026

View reviewed changes

mergify bot added the needs-rebase PR has merge conflicts that need to be resolved label Apr 3, 2026

	let template = "{%- for tool in message['tool_calls'] %}{{'```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```'}}{%- endfor %}";
	// Template with broken syntax that fails AST parse → exercises fallback
	let template = "{%- for tool in message['tool_calls'] %}{{'```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```'}}{%- endfor %}{%";

Conversation

CatherineSue commented Apr 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Changes

Test Plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

claude bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

CatherineSue commented Apr 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 2, 2026 •

edited

Loading