fix(llm): consistent tokenizer types and fix from_tokenizer fallback bug by binaryaaron · Pull Request #287 · NVIDIA-NeMo/Safe-Synthesizer

binaryaaron · 2026-03-24T03:59:44Z

Summary

Follow-up to #286. Fixes two additional issues found while standardizing the metadata module:

LLMPromptConfig.from_tokenizer used the raw tokenizer parameter (possibly None) instead of the resolved _tokenizer local for getattr fallbacks -- silently returning None for BOS/EOS fields when callers omitted tokenizer=
All ModelMetadata subclasses now use PreTrainedTokenizer | None parameter types (was a mix of untyped, AutoTokenizer | None, and PreTrainedTokenizer | None), pass tokenizer= to from_tokenizer, and derive bos_token_id dynamically for Llama32 and SmolLM3 (same bug class as the SmolLM2 fix)
Removed incorrect tokenizer: AutoTokenizer = ... body re-annotations in Mistral and Nemotron that shadowed the parameter with a wrong type

Test plan

uv run pytest tests/llm/test_metadata.py -- 52 passed
tynav diagnostics src/nemo_safe_synthesizer/llm/metadata.py -- 0 errors (was 1 invalid-argument-type before)
make format -- clean
make check -- clean (only pre-existing dumb_tests.ipynb diagnostics)

Made with Cursor

SmolLM2 metadata hardcoded bos_token_id=151644 (copied from Llama32), but SmolLM2 models have a 49152-token vocab where <|im_start|> is at ID 1. The assembler (Example.add_sequence) prepends bos_token_id to every training sequence, so 151644 was passed to Embedding(49152, 576), triggering a CUDA vectorized_gather_kernel index-out-of-bounds assert during training. Look up the actual token ID from the tokenizer at init time instead of hardcoding it. Also fix stale AGENTS.md that said ignore_eos=True (now always False). Signed-off-by: aagonzales <aagonzales@nvidia.com> Made-with: Cursor

from_tokenizer used the raw `tokenizer` parameter (possibly None) instead of the resolved `_tokenizer` local for getattr fallbacks, silently returning None for BOS/EOS fields when callers omitted tokenizer=. Also: all ModelMetadata subclasses now use PreTrainedTokenizer | None parameter types, pass tokenizer= to from_tokenizer, and derive bos_token_id dynamically for Llama32/SmolLM3 (same bug class as the SmolLM2 fix in the parent PR). Signed-off-by: aagonzales <aagonzales@nvidia.com> Made-with: Cursor

kendrickb-nvidia · 2026-03-24T16:55:45Z

tests/llm/test_metadata.py

        expected_add_eos=False,
        expected_bos_token="<|im_start|>",
-        expected_bos_token_id=151644,
+        expected_bos_token_id=1,


suggestion: This should have a comment/explanation that the token id here is what's injected by mock_tokenizer and does not reflect the actual token ids of the named model that would be used running NSS.

tests/llm/test_metadata.py

Co-authored-by: Matt Kornfield <mckornfield@gmail.com> Signed-off-by: Matt Kornfield <mckornfield@gmail.com>

Copilot

Pull request overview

This PR is a follow-up to the metadata standardization work, fixing a LLMPromptConfig.from_tokenizer fallback bug and aligning tokenizer types/usage across ModelMetadata subclasses to avoid silently missing BOS/EOS configuration and hardcoded special-token IDs.

Changes:

Fix LLMPromptConfig.from_tokenizer to use the resolved tokenizer instance for getattr(...) fallbacks.
Standardize tokenizer parameter types to PreTrainedTokenizer | None, consistently pass tokenizer= into from_tokenizer, and derive <|im_start|> token IDs dynamically for Llama32/SmolLM2/SmolLM3.
Update unit tests’ mock tokenizer to support convert_tokens_to_ids, and adjust documentation about generation special-token output behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`tests/llm/test_metadata.py`	Updates expected BOS token IDs and extends the tokenizer mock to support dynamic ID lookup.
`src/nemo_safe_synthesizer/llm/metadata.py`	Fixes tokenizer fallback bug, unifies tokenizer typing, and removes hardcoded `<
`src/nemo_safe_synthesizer/generation/AGENTS.md`	Updates generation “Gotchas” docs around special-token outputs and EOS handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-26T21:03:57Z

tests/llm/test_metadata.py

        expected_add_eos=False,
        expected_bos_token="<|im_start|>",
-        expected_bos_token_id=151644,
+        expected_bos_token_id=1,  #  this is the token_id that the mock_tockenizer injects, not what a real model would use


Typo in the inline comment: "mock_tockenizer" should be "mock_tokenizer" to match the fixture name and avoid confusion when reading the test expectations.

Suggested change

expected_bos_token_id=1, # this is the token_id that the mock_tockenizer injects, not what a real model would use

expected_bos_token_id=1, # this is the token_id that the mock_tokenizer injects, not what a real model would use

Copilot · 2026-03-26T21:03:57Z

src/nemo_safe_synthesizer/generation/AGENTS.md


 - `_resolve_temperature()` — forces `temperature=0.0` when `do_sample=False`; raises if `do_sample=False` and `temperature > 0`.
- `need_special_token_outputs` — `True` when processor is not `TabularDataProcessor` (VllmBackend) or not `TimeSeriesDataProcessor` (TimeseriesBackend). When True: `skip_special_tokens=False`, `include_stop_str_in_output=True`, `ignore_eos=True`. Needed for GroupedDataProcessor BOS/EOS.
+- `need_special_token_outputs` — `True` when processor is not `TabularDataProcessor` (VllmBackend) or not `TimeSeriesDataProcessor` (TimeseriesBackend). When True: `skip_special_tokens=False`, `include_stop_str_in_output=True`. `ignore_eos` is always `False` (native EOS stopping). Needed for GroupedDataProcessor BOS/EOS.


This description of need_special_token_outputs is inaccurate: the stated condition "not TabularDataProcessor ... or not TimeSeriesDataProcessor" would be true for essentially any concrete processor, and it also doesn't match the code. In generation/vllm_backend.py, the condition is need_special_token_outputs = not isinstance(self.processor, TabularDataProcessor), while timeseries_backend.py sets skip_special_tokens=True/include_stop_str_in_output=False unconditionally. Please update this doc line to reflect the actual logic per backend.

Suggested change

- `need_special_token_outputs` — `True` when processor is not `TabularDataProcessor` (VllmBackend) or not `TimeSeriesDataProcessor` (TimeseriesBackend). When True: `skip_special_tokens=False`, `include_stop_str_in_output=True`. `ignore_eos` is always `False` (native EOS stopping). Needed for GroupedDataProcessor BOS/EOS.

- `need_special_token_outputs` — in `VllmBackend`, this is `True` when the processor is not `TabularDataProcessor`, and then we use `skip_special_tokens=False` and `include_stop_str_in_output=True` (needed e.g. for `GroupedDataProcessor` BOS/EOS). In `TimeseriesBackend`, special tokens are always skipped and stop strings are always excluded (`skip_special_tokens=True`, `include_stop_str_in_output=False`), independent of this flag. `ignore_eos` is always `False` (native EOS stopping).

binaryaaron added 2 commits March 24, 2026 01:31

binaryaaron requested a review from a team as a code owner March 24, 2026 03:59

binaryaaron mentioned this pull request Mar 24, 2026

fix: Use trust_remote_code_for_model() in all from_pretrained calls #229

Draft

2 tasks

binaryaaron requested review from kendrickb-nvidia, mckornfield and nina-xu March 24, 2026 14:07

binaryaaron added the bug Something isn't working label Mar 24, 2026

Base automatically changed from binaryaaron/fix-smollm2-bos-token-id to main March 24, 2026 15:43

kendrickb-nvidia requested changes Mar 24, 2026

View reviewed changes

mckornfield reviewed Mar 26, 2026

View reviewed changes

tests/llm/test_metadata.py Outdated Show resolved Hide resolved

Apply suggestions from code review

1aa020e

Co-authored-by: Matt Kornfield <mckornfield@gmail.com> Signed-off-by: Matt Kornfield <mckornfield@gmail.com>

Copilot AI review requested due to automatic review settings March 26, 2026 21:00

Copilot started reviewing on behalf of mckornfield March 26, 2026 21:01 View session

Copilot AI reviewed Mar 26, 2026

View reviewed changes

mckornfield requested a review from kendrickb-nvidia March 26, 2026 21:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(llm): consistent tokenizer types and fix from_tokenizer fallback bug#287

fix(llm): consistent tokenizer types and fix from_tokenizer fallback bug#287
binaryaaron wants to merge 3 commits intomainfrom
binaryaaron/fix-metadata-tokenizer-types

binaryaaron commented Mar 24, 2026

Uh oh!

kendrickb-nvidia Mar 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	expected_bos_token_id=1, # this is the token_id that the mock_tockenizer injects, not what a real model would use
	expected_bos_token_id=1, # this is the token_id that the mock_tokenizer injects, not what a real model would use

Conversation

binaryaaron commented Mar 24, 2026

Summary

Test plan

Uh oh!

kendrickb-nvidia Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kendrickb-nvidia Mar 24, 2026 •

edited

Loading