Skip to content

fix(llm): consistent tokenizer types and fix from_tokenizer fallback bug#287

Open
binaryaaron wants to merge 3 commits intomainfrom
binaryaaron/fix-metadata-tokenizer-types
Open

fix(llm): consistent tokenizer types and fix from_tokenizer fallback bug#287
binaryaaron wants to merge 3 commits intomainfrom
binaryaaron/fix-metadata-tokenizer-types

Conversation

@binaryaaron
Copy link
Copy Markdown
Collaborator

Summary

Follow-up to #286. Fixes two additional issues found while standardizing the metadata module:

  • LLMPromptConfig.from_tokenizer used the raw tokenizer parameter (possibly None) instead of the resolved _tokenizer local for getattr fallbacks -- silently returning None for BOS/EOS fields when callers omitted tokenizer=
  • All ModelMetadata subclasses now use PreTrainedTokenizer | None parameter types (was a mix of untyped, AutoTokenizer | None, and PreTrainedTokenizer | None), pass tokenizer= to from_tokenizer, and derive bos_token_id dynamically for Llama32 and SmolLM3 (same bug class as the SmolLM2 fix)
  • Removed incorrect tokenizer: AutoTokenizer = ... body re-annotations in Mistral and Nemotron that shadowed the parameter with a wrong type

Test plan

  • uv run pytest tests/llm/test_metadata.py -- 52 passed
  • tynav diagnostics src/nemo_safe_synthesizer/llm/metadata.py -- 0 errors (was 1 invalid-argument-type before)
  • make format -- clean
  • make check -- clean (only pre-existing dumb_tests.ipynb diagnostics)

Made with Cursor

SmolLM2 metadata hardcoded bos_token_id=151644 (copied from Llama32),
but SmolLM2 models have a 49152-token vocab where <|im_start|> is at
ID 1.  The assembler (Example.add_sequence) prepends bos_token_id to
every training sequence, so 151644 was passed to Embedding(49152, 576),
triggering a CUDA vectorized_gather_kernel index-out-of-bounds assert
during training.

Look up the actual token ID from the tokenizer at init time instead of
hardcoding it.

Also fix stale AGENTS.md that said ignore_eos=True (now always False).

Signed-off-by: aagonzales <aagonzales@nvidia.com>
Made-with: Cursor
from_tokenizer used the raw `tokenizer` parameter (possibly None) instead
of the resolved `_tokenizer` local for getattr fallbacks, silently
returning None for BOS/EOS fields when callers omitted tokenizer=.

Also: all ModelMetadata subclasses now use PreTrainedTokenizer | None
parameter types, pass tokenizer= to from_tokenizer, and derive
bos_token_id dynamically for Llama32/SmolLM3 (same bug class as the
SmolLM2 fix in the parent PR).

Signed-off-by: aagonzales <aagonzales@nvidia.com>
Made-with: Cursor
@binaryaaron binaryaaron requested a review from a team as a code owner March 24, 2026 03:59
@binaryaaron binaryaaron added the bug Something isn't working label Mar 24, 2026
Base automatically changed from binaryaaron/fix-smollm2-bos-token-id to main March 24, 2026 15:43
expected_add_eos=False,
expected_bos_token="<|im_start|>",
expected_bos_token_id=151644,
expected_bos_token_id=1,
Copy link
Copy Markdown
Collaborator

@kendrickb-nvidia kendrickb-nvidia Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: This should have a comment/explanation that the token id here is what's injected by mock_tokenizer and does not reflect the actual token ids of the named model that would be used running NSS.

Co-authored-by: Matt Kornfield <mckornfield@gmail.com>
Signed-off-by: Matt Kornfield <mckornfield@gmail.com>
Copilot AI review requested due to automatic review settings March 26, 2026 21:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR is a follow-up to the metadata standardization work, fixing a LLMPromptConfig.from_tokenizer fallback bug and aligning tokenizer types/usage across ModelMetadata subclasses to avoid silently missing BOS/EOS configuration and hardcoded special-token IDs.

Changes:

  • Fix LLMPromptConfig.from_tokenizer to use the resolved tokenizer instance for getattr(...) fallbacks.
  • Standardize tokenizer parameter types to PreTrainedTokenizer | None, consistently pass tokenizer= into from_tokenizer, and derive <|im_start|> token IDs dynamically for Llama32/SmolLM2/SmolLM3.
  • Update unit tests’ mock tokenizer to support convert_tokens_to_ids, and adjust documentation about generation special-token output behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
tests/llm/test_metadata.py Updates expected BOS token IDs and extends the tokenizer mock to support dynamic ID lookup.
src/nemo_safe_synthesizer/llm/metadata.py Fixes tokenizer fallback bug, unifies tokenizer typing, and removes hardcoded `<
src/nemo_safe_synthesizer/generation/AGENTS.md Updates generation “Gotchas” docs around special-token outputs and EOS handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

expected_add_eos=False,
expected_bos_token="<|im_start|>",
expected_bos_token_id=151644,
expected_bos_token_id=1, # this is the token_id that the mock_tockenizer injects, not what a real model would use
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the inline comment: "mock_tockenizer" should be "mock_tokenizer" to match the fixture name and avoid confusion when reading the test expectations.

Suggested change
expected_bos_token_id=1, # this is the token_id that the mock_tockenizer injects, not what a real model would use
expected_bos_token_id=1, # this is the token_id that the mock_tokenizer injects, not what a real model would use

Copilot uses AI. Check for mistakes.

- `_resolve_temperature()` — forces `temperature=0.0` when `do_sample=False`; raises if `do_sample=False` and `temperature > 0`.
- `need_special_token_outputs` — `True` when processor is not `TabularDataProcessor` (VllmBackend) or not `TimeSeriesDataProcessor` (TimeseriesBackend). When True: `skip_special_tokens=False`, `include_stop_str_in_output=True`, `ignore_eos=True`. Needed for GroupedDataProcessor BOS/EOS.
- `need_special_token_outputs` — `True` when processor is not `TabularDataProcessor` (VllmBackend) or not `TimeSeriesDataProcessor` (TimeseriesBackend). When True: `skip_special_tokens=False`, `include_stop_str_in_output=True`. `ignore_eos` is always `False` (native EOS stopping). Needed for GroupedDataProcessor BOS/EOS.
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description of need_special_token_outputs is inaccurate: the stated condition "not TabularDataProcessor ... or not TimeSeriesDataProcessor" would be true for essentially any concrete processor, and it also doesn't match the code. In generation/vllm_backend.py, the condition is need_special_token_outputs = not isinstance(self.processor, TabularDataProcessor), while timeseries_backend.py sets skip_special_tokens=True/include_stop_str_in_output=False unconditionally. Please update this doc line to reflect the actual logic per backend.

Suggested change
- `need_special_token_outputs``True` when processor is not `TabularDataProcessor` (VllmBackend) or not `TimeSeriesDataProcessor` (TimeseriesBackend). When True: `skip_special_tokens=False`, `include_stop_str_in_output=True`. `ignore_eos` is always `False` (native EOS stopping). Needed for GroupedDataProcessor BOS/EOS.
- `need_special_token_outputs`in `VllmBackend`, this is `True` when the processor is not `TabularDataProcessor`, and then we use `skip_special_tokens=False` and `include_stop_str_in_output=True` (needed e.g. for `GroupedDataProcessor` BOS/EOS). In `TimeseriesBackend`, special tokens are always skipped and stop strings are always excluded (`skip_special_tokens=True`, `include_stop_str_in_output=False`), independent of this flag. `ignore_eos` is always `False` (native EOS stopping).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants