Fix incorrect prompt tokens count due to HF api update by Kipok · Pull Request #1264 · NVIDIA-NeMo/Skills

Kipok · 2026-02-20T20:34:46Z

Summary by CodeRabbit

Bug Fixes
- Restored compatibility with newer HuggingFace tokenizer outputs so prompt handling, inference prefix measurement, and token counting work correctly across varied tokenizer return formats.
Tests
- Added tests using a real tokenizer to validate token counting for strings, message lists, optional tool data, and null inputs.

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai · 2026-02-20T20:39:32Z

📝 Walkthrough

Walkthrough

Updates three modules and tests to handle HuggingFace tokenizer versions that may return a dict (BatchEncoding) from apply_chat_template; code now extracts input_ids from dicts while retaining support for list returns. Adds tests for token counting with a real tokenizer.

Changes

Cohort / File(s)	Summary
Tokenizer Compatibility `nemo_skills/inference/model/utils.py`, `nemo_skills/inference/prover.py`, `nemo_skills/prompt/utils.py`	Add conditional extraction of `input_ids` when `apply_chat_template` returns a dict/BatchEncoding instead of a list; preserves backward compatibility with list returns for encoding and token counting.
Token Count Tests `tests/test_prompts.py`	Add `test_get_token_count` using `AutoTokenizer` and `get_token_count`, covering strings, message lists, tools schema, and None inputs; adjust imports accordingly.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Token count for BFCL #896: Modifies get_token_count and tokenizer handling in nemo_skills/prompt/utils.py to extract input_ids from dict/BatchEncoding returns.
Generation time + Input Sequence Length #865: Updates token-counting behavior in nemo_skills/prompt/utils.py to handle dict outputs from apply_chat_template.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: fixing token count issues caused by HuggingFace tokenizer API updates across multiple files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch igitman/fix-prompt-count

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

nemo_skills/prompt/utils.py (1)
404-405: Broad except Exception now silences KeyError from the newly-added result["input_ids"] access.

Because lines 400–402 are inside the try block, a KeyError (e.g., the tokenizer returns a dict without an "input_ids" key) is caught here and re-raised as ValueError("Invalid chat message format: 'input_ids'") — a misleading message that hides the real failure.

Additionally, the missing from e in the re-raise discards the original traceback.

Recommended fix: move the dict-normalization step outside the try so only apply_chat_template errors are wrapped, and add from e for proper exception chaining.
♻️ Proposed refactor
     elif isinstance(messages, list):
         messages = [
             message if isinstance(message, dict) else message_to_dict(copy.deepcopy(message)) for message in messages
         ]
         try:
             result = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, tools=tools)
-            # Handle newer HF tokenizer versions that return a dict instead of a list
-            if isinstance(result, dict):
-                result = result["input_ids"]
-            return len(result)
-
         except Exception as e:
-            raise ValueError(f"Invalid chat message format: {e}")
+            raise ValueError(f"Invalid chat message format: {e}") from e
+        # Handle newer HF tokenizer versions that return a dict instead of a list
+        if isinstance(result, dict):
+            result = result["input_ids"]
+        return len(result)
As per coding guidelines: "Do not catch exceptions when they are not normally expected to be raised; let code fail with clear errors instead of silently misbehaving" and "Follow the Zen of Python principles: ensure errors never pass silently unless explicitly silenced."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/prompt/utils.py` around lines 404 - 405, The current broad except
in the function around apply_chat_template is catching KeyError from the new
dict-normalization (accessing result["input_ids"]) and re-raising a misleading
ValueError without chaining; fix by moving the dict normalization/validation
(the access of result["input_ids"] and any token dict checks) outside the
try/except so only apply_chat_template() is wrapped, change the except to catch
only expected exceptions from apply_chat_template (e.g., ValueError or
TemplateError if one exists), and when re-raising include exception chaining
(use "from e") so the original traceback is preserved.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_prompts.py`:
- Around line 21-44: The test uses network-bound AutoTokenizer.from_pretrained
and brittle hard-coded token counts; instead, remove exact numeric assertions
and either (a) mock AutoTokenizer.from_pretrained (or inject a dummy tokenizer)
so the test does not hit the network and assert a sanity bound (e.g., token
count > 10) for get_token_count(tokenizer, messages) and
get_token_count(tokenizer, messages, tools), or (b) mock apply_chat_template to
return a deterministic dict and assert get_token_count correctly computes length
from that dict (and assert get_token_count(None, ...) remains None); focus fixes
around get_token_count, AutoTokenizer.from_pretrained, and apply_chat_template
to eliminate network calls and fragile exact-count checks.

---

Nitpick comments:
In `@nemo_skills/prompt/utils.py`:
- Around line 404-405: The current broad except in the function around
apply_chat_template is catching KeyError from the new dict-normalization
(accessing result["input_ids"]) and re-raising a misleading ValueError without
chaining; fix by moving the dict normalization/validation (the access of
result["input_ids"] and any token dict checks) outside the try/except so only
apply_chat_template() is wrapped, change the except to catch only expected
exceptions from apply_chat_template (e.g., ValueError or TemplateError if one
exists), and when re-raising include exception chaining (use "from e") so the
original traceback is preserved.

coderabbitai · 2026-02-20T20:39:35Z

tests/test_prompts.py

+def test_get_token_count():
+    tokenizer = AutoTokenizer.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", trust_remote_code=True)
+    messages = [{"role": "user", "content": "hello"}]
+
+    tools = [
+        {
+            "type": "function",
+            "function": {
+                "name": "get_weather",
+                "description": "Get the weather",
+                "parameters": {
+                    "type": "object",
+                    "properties": {"location": {"type": "string"}},
+                    "required": ["location"],
+                },
+            },
+        }
+    ]
+
+    assert get_token_count(tokenizer, "hello") == 1
+    assert get_token_count(tokenizer, messages) == 17
+    assert get_token_count(tokenizer, messages, tools=tools) == 266
+    assert get_token_count(None, "hello") is None
+    assert get_token_count(tokenizer, None) is None


⚠️ Potential issue | 🟡 Minor

Hard-coded token counts and a network-bound tokenizer make this test fragile.

Two concerns:

Hard-coded expected values (== 1, == 17, == 266) are tied to the exact current state of nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16's chat template and vocabulary. Any tokenizer patch from the model owner will silently break these assertions, requiring manual inspection to distinguish a real regression from a tokenizer update.

Network dependency. AutoTokenizer.from_pretrained(...) downloads the tokenizer files from HuggingFace Hub at test time. This makes the test slow and may fail in air-gapped or rate-limited CI environments. The existing prompt tests (e.g., test_generic_math_prompt) avoid this concern by also downloading models but at least they test deterministic string rendering rather than numeric token counts that can drift.

The core intent — verifying that the dict-return path produces the correct token count rather than len(dict) — is sound. Consider asserting count > 10 (sanity-bound) instead of an exact value, or mock apply_chat_template to explicitly return a dict and assert the extracted length is correct.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test_prompts.py` around lines 21 - 44, The test uses network-bound AutoTokenizer.from_pretrained and brittle hard-coded token counts; instead, remove exact numeric assertions and either (a) mock AutoTokenizer.from_pretrained (or inject a dummy tokenizer) so the test does not hit the network and assert a sanity bound (e.g., token count > 10) for get_token_count(tokenizer, messages) and get_token_count(tokenizer, messages, tools), or (b) mock apply_chat_template to return a deterministic dict and assert get_token_count correctly computes length from that dict (and assert get_token_count(None, ...) remains None); focus fixes around get_token_count, AutoTokenizer.from_pretrained, and apply_chat_template to eliminate network calls and fragile exact-count checks.

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/prompt/utils.py`:
- Around line 398-405: The current try/except around
tokenizer.apply_chat_template(...) swallows and masks underlying errors (e.g.
KeyError from accessing result["input_ids"]); either remove the outer try/except
so underlying exceptions surface with their original tracebacks, or if you must
keep the wrapper, re-raise while preserving the exception chain by using raise
ValueError(f"Invalid chat message format: {e}") from e; update the handler that
references result["input_ids"] and the call to tokenizer.apply_chat_template to
ensure errors are not hidden.

coderabbitai · 2026-02-20T22:20:43Z

nemo_skills/prompt/utils.py

+            result = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, tools=tools)
+            # Handle newer HF tokenizer versions that return a BatchEncoding instead of a list
+            if not isinstance(result, list):
+                result = result["input_ids"]
+            return len(result)
+
        except Exception as e:
            raise ValueError(f"Invalid chat message format: {e}")


⚠️ Potential issue | 🟡 Minor

except Exception masks the new KeyError path; add raise … from e.

After the new lines 400-401, if result is neither a list nor a dict-like object that contains input_ids, the resulting KeyError falls into the catch-all handler at line 404 and surfaces as "Invalid chat message format: 'input_ids'" — a misleading message that hides the real cause. More broadly, the pre-existing except Exception block also violates the project guideline to let code fail with clear errors instead of silently misbehaving.

At minimum, preserve the exception chain with from e (static analysis B904) so the original traceback is not swallowed:

🔗 Proposed fix: preserve exception chain

except Exception as e: - raise ValueError(f"Invalid chat message format: {e}") + raise ValueError(f"Invalid chat message format: {e}") from e

Ideally, drop the wrapper entirely and let apply_chat_template (and result["input_ids"]) fail with their own clear errors, in line with the guideline to avoid silently misbehaving.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

result = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, tools=tools)

# Handle newer HF tokenizer versions that return a BatchEncoding instead of a list

if not isinstance(result, list):

result = result["input_ids"]

return len(result)

except Exception as e:

raise ValueError(f"Invalid chat message format: {e}")

result = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, tools=tools)

# Handle newer HF tokenizer versions that return a BatchEncoding instead of a list

if not isinstance(result, list):

result = result["input_ids"]

return len(result)

except Exception as e:

raise ValueError(f"Invalid chat message format: {e}") from e

🧰 Tools

🪛 Ruff (0.15.1)

[warning] 404-404: Do not catch blind exception: Exception

(BLE001)

[warning] 405-405: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

[warning] 405-405: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/prompt/utils.py` around lines 398 - 405, The current try/except around tokenizer.apply_chat_template(...) swallows and masks underlying errors (e.g. KeyError from accessing result["input_ids"]); either remove the outer try/except so underlying exceptions surface with their original tracebacks, or if you must keep the wrapper, re-raise while preserving the exception chain by using raise ValueError(f"Invalid chat message format: {e}") from e; update the handler that references result["input_ids"] and the call to tokenizer.apply_chat_template to ensure errors are not hidden.

Kipok added 3 commits February 20, 2026 12:23

Debugging

05135bb

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Handle newer hf transformers chat template api

56ae5e8

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Update tests

30d6ca9

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Kipok requested review from ekmb and jiacheng-xu February 20, 2026 20:34

coderabbitai bot reviewed Feb 20, 2026

View reviewed changes

Kipok added 2 commits February 20, 2026 14:13

Debugging

aa9e7fd

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix isinstance check

70ab73e

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai bot reviewed Feb 20, 2026

View reviewed changes

jiacheng-xu approved these changes Feb 20, 2026

View reviewed changes

Kipok merged commit 1f1a2e7 into main Feb 20, 2026
5 checks passed

Kipok deleted the igitman/fix-prompt-count branch February 20, 2026 23:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix incorrect prompt tokens count due to HF api update#1264

Fix incorrect prompt tokens count due to HF api update#1264
Kipok merged 5 commits intomainfrom
igitman/fix-prompt-count

Kipok commented Feb 20, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 20, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 20, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

Kipok commented Feb 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kipok commented Feb 20, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 20, 2026 •

edited

Loading