fix: Add get_hf_tokenizer_kwargs to NemotronHBridge by shanecmoran · Pull Request #2664 · NVIDIA-NeMo/Megatron-Bridge

shanecmoran · 2026-03-05T17:06:33Z

Summary

Nemotron Nano v2 models only provide a fast tokenizer (tokenizer.json, no tokenizer.model). The container's megatron-core defaults use_fast=False, causing AutoTokenizer.from_pretrained to silently return False instead of a tokenizer object, which crashes during checkpoint conversion.

Changes

Added get_hf_tokenizer_kwargs() classmethod to NemotronHBridge returning {"use_fast": True}
Added unit tests for the new method

Follows the existing GLM45VBridge pattern. The call site in auto_bridge.py:768 already uses hasattr to check for this method — no other changes needed.

Fixes #2663

Summary by CodeRabbit

New Features
- Added standardized tokenizer configuration for Nemotron-H models, enforcing the use of fast tokenizers for optimal performance and consistency.
Tests
- Introduced test validation to verify tokenizer configuration settings and ensure fast tokenizer is properly enabled.

copy-pr-bot · 2026-03-05T17:06:36Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-03-05T17:10:55Z

📝 Walkthrough

Walkthrough

Adds the get_hf_tokenizer_kwargs classmethod to NemotronHBridge that returns {"use_fast": True} to enforce fast tokenizer usage for Nemotron-H models. Includes unit tests validating the method's return value and configuration.

Changes

Cohort / File(s)	Summary
NemotronHBridge Implementation `src/megatron/bridge/models/nemotronh/nemotron_h_bridge.py`	Adds `get_hf_tokenizer_kwargs()` classmethod that returns a dict with `use_fast: True` to override default tokenizer behavior.
Unit Tests `tests/unit_tests/models/nemotronh/test_nemotron_h_bridge.py`	Adds `TestNemotronHBridgeTokenizerKwargs` test class with two test methods validating return type and `use_fast` configuration. Note: Tests appear duplicated within the file.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	⚠️ Warning	The PR contains test duplication in test_nemotron_h_bridge.py (identical test suites inserted twice), which appears to be an unintended out-of-scope issue.	Remove the duplicate test suite from test_nemotron_h_bridge.py to keep the PR focused on the required implementation.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main change: adding the get_hf_tokenizer_kwargs method to NemotronHBridge.
Linked Issues check	✅ Passed	The PR implements the required get_hf_tokenizer_kwargs method on NemotronHBridge returning {"use_fast": True} and adds unit tests, fully addressing issue `#2663`.
Docstring Coverage	✅ Passed	Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes	✅ Passed	PR adds a 9-line classmethod following existing GLM45VBridge pattern with comprehensive unit tests, meeting the custom check requirement for minor changes with documented test results.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

tests/unit_tests/models/nemotronh/test_nemotron_h_bridge.py (1)
358-370: Add a test category marker on the new tokenizer kwargs suite.

Please add a pytest.mark classification (for example, unit) to this new test class to match the test categorization convention.
Suggested change
+@pytest.mark.unit
 class TestNemotronHBridgeTokenizerKwargs:
     """Test get_hf_tokenizer_kwargs method."""
As per coding guidelines, "Use 'pytest.mark' to categorize tests (unit, integration, system)."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/models/nemotronh/test_nemotron_h_bridge.py` around lines 358
- 370, The test class TestNemotronHBridgeTokenizerKwargs lacks a pytest.mark
category; add a marker (e.g., `@pytest.mark.unit`) above the class definition to
classify these tests as unit tests and keep suite conventions consistent—apply
the decorator to the TestNemotronHBridgeTokenizerKwargs class that contains
calls to NemotronHBridge.get_hf_tokenizer_kwargs().

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unit_tests/models/nemotronh/test_nemotron_h_bridge.py`:
- Around line 358-370: The test class TestNemotronHBridgeTokenizerKwargs lacks a
pytest.mark category; add a marker (e.g., `@pytest.mark.unit`) above the class
definition to classify these tests as unit tests and keep suite conventions
consistent—apply the decorator to the TestNemotronHBridgeTokenizerKwargs class
that contains calls to NemotronHBridge.get_hf_tokenizer_kwargs().

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9eafc85f-c796-4c56-b48a-7345116880a4

📥 Commits

Reviewing files that changed from the base of the PR and between 01cb6be and 7d29884.

📒 Files selected for processing (2)

src/megatron/bridge/models/nemotronh/nemotron_h_bridge.py
tests/unit_tests/models/nemotronh/test_nemotron_h_bridge.py

yaoyu-33 · 2026-03-05T19:27:25Z

/ok to test 7d29884

Nemotron Nano v2 models only provide a fast tokenizer (tokenizer.json, no tokenizer.model). The container's megatron-core defaults use_fast=False, causing AutoTokenizer.from_pretrained to silently return False instead of a tokenizer object, which then crashes on self.tokenizer.chat_template during checkpoint conversion. Add get_hf_tokenizer_kwargs returning {"use_fast": True}, following the existing GLM45VBridge pattern. The call site in auto_bridge.py already uses hasattr to check for this method. Fixes: NVIDIA-NeMo#2663 Signed-off-by: Shane Moran <shane.moran@shopify.com>

coderabbitai bot reviewed Mar 5, 2026

View reviewed changes

github-actions bot added the community-request label Mar 5, 2026

yaoyu-33 approved these changes Mar 5, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to test March 5, 2026 19:28 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 5, 2026 20:19 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 5, 2026 20:33 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 5, 2026 21:05 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 5, 2026 21:05 Failure

copy-pr-bot bot temporarily deployed to nemo-ci March 5, 2026 21:05 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 5, 2026 21:05 Failure

copy-pr-bot bot temporarily deployed to nemo-ci March 5, 2026 21:05 Inactive

shanecmoran force-pushed the fix/nemotron-tokenizer-kwargs branch from 7d29884 to a5588a7 Compare March 6, 2026 13:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Add get_hf_tokenizer_kwargs to NemotronHBridge#2664

fix: Add get_hf_tokenizer_kwargs to NemotronHBridge#2664
shanecmoran wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
shanecmoran:fix/nemotron-tokenizer-kwargs

shanecmoran commented Mar 5, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Mar 5, 2026

Uh oh!

coderabbitai bot commented Mar 5, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

yaoyu-33 commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shanecmoran commented Mar 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Mar 5, 2026

Uh oh!

coderabbitai bot commented Mar 5, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shanecmoran commented Mar 5, 2026 •

edited by coderabbitai bot

Loading