fix(llm): consistent tokenizer types and fix from_tokenizer fallback bug#287
fix(llm): consistent tokenizer types and fix from_tokenizer fallback bug#287binaryaaron wants to merge 3 commits intomainfrom
Conversation
SmolLM2 metadata hardcoded bos_token_id=151644 (copied from Llama32), but SmolLM2 models have a 49152-token vocab where <|im_start|> is at ID 1. The assembler (Example.add_sequence) prepends bos_token_id to every training sequence, so 151644 was passed to Embedding(49152, 576), triggering a CUDA vectorized_gather_kernel index-out-of-bounds assert during training. Look up the actual token ID from the tokenizer at init time instead of hardcoding it. Also fix stale AGENTS.md that said ignore_eos=True (now always False). Signed-off-by: aagonzales <aagonzales@nvidia.com> Made-with: Cursor
from_tokenizer used the raw `tokenizer` parameter (possibly None) instead of the resolved `_tokenizer` local for getattr fallbacks, silently returning None for BOS/EOS fields when callers omitted tokenizer=. Also: all ModelMetadata subclasses now use PreTrainedTokenizer | None parameter types, pass tokenizer= to from_tokenizer, and derive bos_token_id dynamically for Llama32/SmolLM3 (same bug class as the SmolLM2 fix in the parent PR). Signed-off-by: aagonzales <aagonzales@nvidia.com> Made-with: Cursor
tests/llm/test_metadata.py
Outdated
| expected_add_eos=False, | ||
| expected_bos_token="<|im_start|>", | ||
| expected_bos_token_id=151644, | ||
| expected_bos_token_id=1, |
There was a problem hiding this comment.
suggestion: This should have a comment/explanation that the token id here is what's injected by mock_tokenizer and does not reflect the actual token ids of the named model that would be used running NSS.
Co-authored-by: Matt Kornfield <mckornfield@gmail.com> Signed-off-by: Matt Kornfield <mckornfield@gmail.com>
There was a problem hiding this comment.
Pull request overview
This PR is a follow-up to the metadata standardization work, fixing a LLMPromptConfig.from_tokenizer fallback bug and aligning tokenizer types/usage across ModelMetadata subclasses to avoid silently missing BOS/EOS configuration and hardcoded special-token IDs.
Changes:
- Fix
LLMPromptConfig.from_tokenizerto use the resolved tokenizer instance forgetattr(...)fallbacks. - Standardize
tokenizerparameter types toPreTrainedTokenizer | None, consistently passtokenizer=intofrom_tokenizer, and derive<|im_start|>token IDs dynamically for Llama32/SmolLM2/SmolLM3. - Update unit tests’ mock tokenizer to support
convert_tokens_to_ids, and adjust documentation about generation special-token output behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
tests/llm/test_metadata.py |
Updates expected BOS token IDs and extends the tokenizer mock to support dynamic ID lookup. |
src/nemo_safe_synthesizer/llm/metadata.py |
Fixes tokenizer fallback bug, unifies tokenizer typing, and removes hardcoded `< |
src/nemo_safe_synthesizer/generation/AGENTS.md |
Updates generation “Gotchas” docs around special-token outputs and EOS handling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| expected_add_eos=False, | ||
| expected_bos_token="<|im_start|>", | ||
| expected_bos_token_id=151644, | ||
| expected_bos_token_id=1, # this is the token_id that the mock_tockenizer injects, not what a real model would use |
There was a problem hiding this comment.
Typo in the inline comment: "mock_tockenizer" should be "mock_tokenizer" to match the fixture name and avoid confusion when reading the test expectations.
| expected_bos_token_id=1, # this is the token_id that the mock_tockenizer injects, not what a real model would use | |
| expected_bos_token_id=1, # this is the token_id that the mock_tokenizer injects, not what a real model would use |
|
|
||
| - `_resolve_temperature()` — forces `temperature=0.0` when `do_sample=False`; raises if `do_sample=False` and `temperature > 0`. | ||
| - `need_special_token_outputs` — `True` when processor is not `TabularDataProcessor` (VllmBackend) or not `TimeSeriesDataProcessor` (TimeseriesBackend). When True: `skip_special_tokens=False`, `include_stop_str_in_output=True`, `ignore_eos=True`. Needed for GroupedDataProcessor BOS/EOS. | ||
| - `need_special_token_outputs` — `True` when processor is not `TabularDataProcessor` (VllmBackend) or not `TimeSeriesDataProcessor` (TimeseriesBackend). When True: `skip_special_tokens=False`, `include_stop_str_in_output=True`. `ignore_eos` is always `False` (native EOS stopping). Needed for GroupedDataProcessor BOS/EOS. |
There was a problem hiding this comment.
This description of need_special_token_outputs is inaccurate: the stated condition "not TabularDataProcessor ... or not TimeSeriesDataProcessor" would be true for essentially any concrete processor, and it also doesn't match the code. In generation/vllm_backend.py, the condition is need_special_token_outputs = not isinstance(self.processor, TabularDataProcessor), while timeseries_backend.py sets skip_special_tokens=True/include_stop_str_in_output=False unconditionally. Please update this doc line to reflect the actual logic per backend.
| - `need_special_token_outputs` — `True` when processor is not `TabularDataProcessor` (VllmBackend) or not `TimeSeriesDataProcessor` (TimeseriesBackend). When True: `skip_special_tokens=False`, `include_stop_str_in_output=True`. `ignore_eos` is always `False` (native EOS stopping). Needed for GroupedDataProcessor BOS/EOS. | |
| - `need_special_token_outputs` — in `VllmBackend`, this is `True` when the processor is not `TabularDataProcessor`, and then we use `skip_special_tokens=False` and `include_stop_str_in_output=True` (needed e.g. for `GroupedDataProcessor` BOS/EOS). In `TimeseriesBackend`, special tokens are always skipped and stop strings are always excluded (`skip_special_tokens=True`, `include_stop_str_in_output=False`), independent of this flag. `ignore_eos` is always `False` (native EOS stopping). |
Summary
Follow-up to #286. Fixes two additional issues found while standardizing the metadata module:
LLMPromptConfig.from_tokenizerused the rawtokenizerparameter (possiblyNone) instead of the resolved_tokenizerlocal forgetattrfallbacks -- silently returningNonefor BOS/EOS fields when callers omittedtokenizer=ModelMetadatasubclasses now usePreTrainedTokenizer | Noneparameter types (was a mix of untyped,AutoTokenizer | None, andPreTrainedTokenizer | None), passtokenizer=tofrom_tokenizer, and derivebos_token_iddynamically for Llama32 and SmolLM3 (same bug class as the SmolLM2 fix)tokenizer: AutoTokenizer = ...body re-annotations in Mistral and Nemotron that shadowed the parameter with a wrong typeTest plan
uv run pytest tests/llm/test_metadata.py-- 52 passedtynav diagnostics src/nemo_safe_synthesizer/llm/metadata.py-- 0 errors (was 1invalid-argument-typebefore)make format-- cleanmake check-- clean (only pre-existingdumb_tests.ipynbdiagnostics)Made with Cursor