Fix incorrect prompt tokens count due to HF api update#1264
Conversation
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
📝 WalkthroughWalkthroughUpdates three modules and tests to handle HuggingFace tokenizer versions that may return a dict (BatchEncoding) from Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
nemo_skills/prompt/utils.py (1)
404-405: Broadexcept Exceptionnow silencesKeyErrorfrom the newly-addedresult["input_ids"]access.Because lines 400–402 are inside the
tryblock, aKeyError(e.g., the tokenizer returns a dict without an"input_ids"key) is caught here and re-raised asValueError("Invalid chat message format: 'input_ids'")— a misleading message that hides the real failure.Additionally, the missing
from ein the re-raise discards the original traceback.Recommended fix: move the dict-normalization step outside the
tryso onlyapply_chat_templateerrors are wrapped, and addfrom efor proper exception chaining.♻️ Proposed refactor
elif isinstance(messages, list): messages = [ message if isinstance(message, dict) else message_to_dict(copy.deepcopy(message)) for message in messages ] try: result = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, tools=tools) - # Handle newer HF tokenizer versions that return a dict instead of a list - if isinstance(result, dict): - result = result["input_ids"] - return len(result) - except Exception as e: - raise ValueError(f"Invalid chat message format: {e}") + raise ValueError(f"Invalid chat message format: {e}") from e + # Handle newer HF tokenizer versions that return a dict instead of a list + if isinstance(result, dict): + result = result["input_ids"] + return len(result)As per coding guidelines: "Do not catch exceptions when they are not normally expected to be raised; let code fail with clear errors instead of silently misbehaving" and "Follow the Zen of Python principles: ensure errors never pass silently unless explicitly silenced."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/prompt/utils.py` around lines 404 - 405, The current broad except in the function around apply_chat_template is catching KeyError from the new dict-normalization (accessing result["input_ids"]) and re-raising a misleading ValueError without chaining; fix by moving the dict normalization/validation (the access of result["input_ids"] and any token dict checks) outside the try/except so only apply_chat_template() is wrapped, change the except to catch only expected exceptions from apply_chat_template (e.g., ValueError or TemplateError if one exists), and when re-raising include exception chaining (use "from e") so the original traceback is preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_prompts.py`:
- Around line 21-44: The test uses network-bound AutoTokenizer.from_pretrained
and brittle hard-coded token counts; instead, remove exact numeric assertions
and either (a) mock AutoTokenizer.from_pretrained (or inject a dummy tokenizer)
so the test does not hit the network and assert a sanity bound (e.g., token
count > 10) for get_token_count(tokenizer, messages) and
get_token_count(tokenizer, messages, tools), or (b) mock apply_chat_template to
return a deterministic dict and assert get_token_count correctly computes length
from that dict (and assert get_token_count(None, ...) remains None); focus fixes
around get_token_count, AutoTokenizer.from_pretrained, and apply_chat_template
to eliminate network calls and fragile exact-count checks.
---
Nitpick comments:
In `@nemo_skills/prompt/utils.py`:
- Around line 404-405: The current broad except in the function around
apply_chat_template is catching KeyError from the new dict-normalization
(accessing result["input_ids"]) and re-raising a misleading ValueError without
chaining; fix by moving the dict normalization/validation (the access of
result["input_ids"] and any token dict checks) outside the try/except so only
apply_chat_template() is wrapped, change the except to catch only expected
exceptions from apply_chat_template (e.g., ValueError or TemplateError if one
exists), and when re-raising include exception chaining (use "from e") so the
original traceback is preserved.
| def test_get_token_count(): | ||
| tokenizer = AutoTokenizer.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", trust_remote_code=True) | ||
| messages = [{"role": "user", "content": "hello"}] | ||
|
|
||
| tools = [ | ||
| { | ||
| "type": "function", | ||
| "function": { | ||
| "name": "get_weather", | ||
| "description": "Get the weather", | ||
| "parameters": { | ||
| "type": "object", | ||
| "properties": {"location": {"type": "string"}}, | ||
| "required": ["location"], | ||
| }, | ||
| }, | ||
| } | ||
| ] | ||
|
|
||
| assert get_token_count(tokenizer, "hello") == 1 | ||
| assert get_token_count(tokenizer, messages) == 17 | ||
| assert get_token_count(tokenizer, messages, tools=tools) == 266 | ||
| assert get_token_count(None, "hello") is None | ||
| assert get_token_count(tokenizer, None) is None |
There was a problem hiding this comment.
Hard-coded token counts and a network-bound tokenizer make this test fragile.
Two concerns:
-
Hard-coded expected values (
== 1,== 17,== 266) are tied to the exact current state ofnvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16's chat template and vocabulary. Any tokenizer patch from the model owner will silently break these assertions, requiring manual inspection to distinguish a real regression from a tokenizer update. -
Network dependency.
AutoTokenizer.from_pretrained(...)downloads the tokenizer files from HuggingFace Hub at test time. This makes the test slow and may fail in air-gapped or rate-limited CI environments. The existing prompt tests (e.g.,test_generic_math_prompt) avoid this concern by also downloading models but at least they test deterministic string rendering rather than numeric token counts that can drift.
The core intent — verifying that the dict-return path produces the correct token count rather than len(dict) — is sound. Consider asserting count > 10 (sanity-bound) instead of an exact value, or mock apply_chat_template to explicitly return a dict and assert the extracted length is correct.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_prompts.py` around lines 21 - 44, The test uses network-bound
AutoTokenizer.from_pretrained and brittle hard-coded token counts; instead,
remove exact numeric assertions and either (a) mock
AutoTokenizer.from_pretrained (or inject a dummy tokenizer) so the test does not
hit the network and assert a sanity bound (e.g., token count > 10) for
get_token_count(tokenizer, messages) and get_token_count(tokenizer, messages,
tools), or (b) mock apply_chat_template to return a deterministic dict and
assert get_token_count correctly computes length from that dict (and assert
get_token_count(None, ...) remains None); focus fixes around get_token_count,
AutoTokenizer.from_pretrained, and apply_chat_template to eliminate network
calls and fragile exact-count checks.
Signed-off-by: Igor Gitman <igitman@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/prompt/utils.py`:
- Around line 398-405: The current try/except around
tokenizer.apply_chat_template(...) swallows and masks underlying errors (e.g.
KeyError from accessing result["input_ids"]); either remove the outer try/except
so underlying exceptions surface with their original tracebacks, or if you must
keep the wrapper, re-raise while preserving the exception chain by using raise
ValueError(f"Invalid chat message format: {e}") from e; update the handler that
references result["input_ids"] and the call to tokenizer.apply_chat_template to
ensure errors are not hidden.
| result = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, tools=tools) | ||
| # Handle newer HF tokenizer versions that return a BatchEncoding instead of a list | ||
| if not isinstance(result, list): | ||
| result = result["input_ids"] | ||
| return len(result) | ||
|
|
||
| except Exception as e: | ||
| raise ValueError(f"Invalid chat message format: {e}") |
There was a problem hiding this comment.
except Exception masks the new KeyError path; add raise … from e.
After the new lines 400-401, if result is neither a list nor a dict-like object that contains input_ids, the resulting KeyError falls into the catch-all handler at line 404 and surfaces as "Invalid chat message format: 'input_ids'" — a misleading message that hides the real cause. More broadly, the pre-existing except Exception block also violates the project guideline to let code fail with clear errors instead of silently misbehaving.
At minimum, preserve the exception chain with from e (static analysis B904) so the original traceback is not swallowed:
🔗 Proposed fix: preserve exception chain
except Exception as e:
- raise ValueError(f"Invalid chat message format: {e}")
+ raise ValueError(f"Invalid chat message format: {e}") from eIdeally, drop the wrapper entirely and let apply_chat_template (and result["input_ids"]) fail with their own clear errors, in line with the guideline to avoid silently misbehaving.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| result = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, tools=tools) | |
| # Handle newer HF tokenizer versions that return a BatchEncoding instead of a list | |
| if not isinstance(result, list): | |
| result = result["input_ids"] | |
| return len(result) | |
| except Exception as e: | |
| raise ValueError(f"Invalid chat message format: {e}") | |
| result = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, tools=tools) | |
| # Handle newer HF tokenizer versions that return a BatchEncoding instead of a list | |
| if not isinstance(result, list): | |
| result = result["input_ids"] | |
| return len(result) | |
| except Exception as e: | |
| raise ValueError(f"Invalid chat message format: {e}") from e |
🧰 Tools
🪛 Ruff (0.15.1)
[warning] 404-404: Do not catch blind exception: Exception
(BLE001)
[warning] 405-405: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
[warning] 405-405: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/prompt/utils.py` around lines 398 - 405, The current try/except
around tokenizer.apply_chat_template(...) swallows and masks underlying errors
(e.g. KeyError from accessing result["input_ids"]); either remove the outer
try/except so underlying exceptions surface with their original tracebacks, or
if you must keep the wrapper, re-raise while preserving the exception chain by
using raise ValueError(f"Invalid chat message format: {e}") from e; update the
handler that references result["input_ids"] and the call to
tokenizer.apply_chat_template to ensure errors are not hidden.
Summary by CodeRabbit
Bug Fixes
Tests