Draft
Conversation
Contributor
|
This PR has been automatically converted to draft because all PRs must start as drafts. When you are ready for review, click Ready for Review to begin the review process. This will:
See the contribution guide for more details. |
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Contributor
Author
|
/ok to test 54bbe5f |
Contributor
Author
|
/ok to test 0634e70 |
Contributor
Author
|
/ok to test 69724b5 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Summary
This PR performs a structural refactoring of the Megatron tokenizer subsystem:
text/models/,vision/models/) that added indirection with zero functionality.conversation/module, deleting the standaloneSFTTokenizerclass and theMegatronTokenizerChatTemplatemixin.--tokenizer-library,--tokenizer-mode,--tokenizer-prompt-format) that replaces the monolithic--tokenizer-typeflag, with full backward compatibility.Motivation
The previous tokenizer codebase had several architectural problems:
Empty model wrappers:
GPTTokenizer,BertTokenizer,MambaTokenizer,T5Tokenizer,DefaultTokenizerText, andDefaultTokenizerVisionwere all trivial subclasses that setclass_name/class_pathin metadata and did nothing else. They added a routing layer (TOKENIZER_MAPPING_NAMES+importlibdynamic lookup) that obscured the actual code path.SFT was a fake "library":
SFTTokenizerwas registered as a tokenizer library ("sft") but internally created its owntransformers.AutoTokenizer.from_pretrained(), completely bypassing the library abstraction. It was not a subclass ofMegatronTokenizerTextAbstractand could only work with HuggingFace — making--tokenizer-library sentencepiece --tokenizer-mode sftimpossible.Duplicated conversation tokenization: The multimodal tokenizer (
multimodal_tokenizer.py) had its own inline implementation of conversation tokenization (~80 lines) with per-turn masking logic. The SFT tokenizer had a separate but nearly identical implementation. Both duplicated the same pattern: apply chat template to the full conversation, then iterate per-turn to build the target mask. Bugs fixed in one wouldn't be fixed in the other.Chat template as a mixin:
MegatronTokenizerChatTemplatewas a mixin class in its own file, inherited bySentencePieceTokenizerandTikTokenTokenizervia multiple inheritance. It added Jinja2-basedapply_chat_template, but with a restrictive signature (no**kwargs,chat_templatewas positional). The HuggingFace tokenizer had its own completely separateapply_chat_templatethat delegated to HF's native method — with a different signature. This meant conversation tokenization code had to be HF-specific.Hardcoded HF assumptions in conversation code:
conversation_tokenizer.pyhad a parameter namedhf_tokenizer, used HF-specific kwargs (return_tensors="np",return_assistant_token_mask=False), and prompt config factories calledtokenizer.convert_tokens_to_ids()/tokenizer.pad_token_id— all HuggingFace-specific APIs.Monolithic
build_tokenizer: Thebuild_tokenizer()function was a single 90-line if/elif chain mapping each--tokenizer-typevalue to library-specific kwargs. Adding a new tokenizer type meant extending this chain.New Tokenizer Architecture
Before
After
Key change: SFT is no longer a library — it's a capability. Any text tokenizer library gains conversation tokenization when
prompt_formatis provided. The model-wrapper layer is gone entirely.What Changed
1. Removed Empty Model Wrappers
Deleted files:
text/models/bert_tokenizer.py(12 lines)text/models/gpt_tokenizer.py(12 lines)text/models/mamba_tokenizer.py(12 lines)text/models/t5_tokenizer.py(12 lines)text/models/default_tokenizer.py(12 lines)vision/models/default_tokenizer.py(12 lines)Each was an empty subclass like:
Resolution: The
class_name/class_pathmetadata injection was moved intoMegatronTokenizerText.__init__()itself viaconfig.setdefault(...). TheTOKENIZER_MAPPING_NAMESregistry andimportlib-based dynamic dispatch inmegatron_tokenizer.pywere replaced with a directTEXT_LIBRARIES/VISION_LIBRARIEScheck that returnsMegatronTokenizerTextorMegatronTokenizerVisiondirectly.Backward-compat aliases are kept in
text/models/__init__.pyandvision/models/__init__.pyfor any external code that imports by class name.2. Deleted
SFTTokenizer— SFT is Now a CapabilityDeleted:
text/libraries/sft_tokenizer.py(254 lines)SFTTokenizerwas a standalone class that:transformers.AutoTokenizer.from_pretrained()internally"sft"inTOKENIZER_MAPPING_LIBRARIESMegatronTokenizerTextAbstractResolution: SFT conversation tokenization is now a runtime capability of
MegatronTokenizerText. Whenprompt_formatis provided as a kwarg:This means any text library tokenizer (SentencePiece, TikToken, HuggingFace, etc.) can now do SFT conversation tokenization, not just HuggingFace.
References removed:
("sft", "SFTTokenizer")fromTOKENIZER_MAPPING_LIBRARIES"sft"fromTEXT_LIBRARIESSFTTokenizerimport fromtext/libraries/__init__.py3. Deleted
chat_template.pyMixin — Folded into Abstract BaseDeleted:
text/libraries/chat_template.py(71 lines)MegatronTokenizerChatTemplatewas a mixin class used via multiple inheritance:Resolution: The
apply_chat_template()implementation was moved directly intoMegatronTokenizerTextAbstractas a concrete (non-abstract) method. The signature was improved:chat_templatechanged from required positional to optional keyword (falls back toself.chat_template)**kwargsto absorb HF-specific kwargs transparentlytext_to_ids()tokenizationHuggingFaceTokenizerkeeps its ownapply_chat_templateoverride that delegates to HF's native method.Additionally, a concrete
token_to_id()default was added to the abstract base:This ensures all library tokenizers expose a canonical single-token-to-ID method. SP, TikToken, and HF already override it with optimized versions.
4. Created Shared
conversation/ModuleNew module:
megatron/core/tokenizers/conversation/__init__.py— exportstokenize_conversation,PROMPT_FORMAT_REGISTRY,PromptConfigconversation_tokenizer.py— single library-agnostic conversation tokenization implementationprompt_config.py—PromptConfigdataclass, chat template strings, factory registryThis module consolidates:
SFTTokenizer)MegatronMultimodalTokenizer.tokenize_conversation)multimodal_tokenizer.pyinline definitions)5. De-duplicated Multimodal Conversation Tokenization
Before:
multimodal_tokenizer.pyhad ~80 lines of inline conversation tokenization + ~80 lines of per-formatPromptConfigconstruction, duplicating the SFT logic.After:
MegatronMultimodalTokenizer.tokenize_conversation()is now a single call:The per-format
if/elifchain was replaced withPROMPT_FORMAT_REGISTRY[prompt_format](tokenizer).6. Made Conversation Tokenization Library-Agnostic
conversation_tokenizer.py:hf_tokenizer→tokenizerreturn_tensors="np",return_assistant_token_mask=False)np.ndarrayregardless of backendprompt_config.py— added adapter helpers:All 14 factory functions use these helpers instead of HF-specific attribute access.
7. New CLI API with Backward Compatibility
New flags (in
arguments.py):--tokenizer-library— choices:huggingface,sentencepiece,tiktoken,megatron,byte-level,null--tokenizer-mode— choices:text,sft,multimodal(default:text)--tokenizer-prompt-format— prompt format name for SFT or multimodalBackward compatibility:
--tokenizer-typeis deprecated but fully supported. A mapping table invalidate_args()converts each legacy type to its(library, mode)pair:Refactored
build_tokenizer(): The 90-line if/elif chain was decomposed into:_resolve_library(args)— maps library+mode to internal library string_resolve_tokenizer_path(args)— resolves the model path_build_library_kwargs(args)— builds library-specific kwargs_build_mode_kwargs(args)— builds mode-specific kwargs (SFT prompt format, multimodal settings)8. Bug Fixes
eodfallback:MegatronTokenizerText.eodnow falls back toeos_idwhen the underlying library tokenizer doesn't defineeod(e.g., SentencePiece). Previously this would raiseAttributeError.pad/pad_idwith SFT: When aprompt_configis active,MegatronTokenizerText.padandpad_idnow returnprompt_config.pad_token_idinstead of the library's default pad. This is critical for SFT masking where the prompt config defines its own pad token.Infinite recursion in abstract base: The previous
abstract_tokenizer.pyhad property aliases like@property def cls_id(self): if hasattr(self, 'cls_id'): return self.cls_idwhich would infinite-recurse on any tokenizer that doesn't definecls_idelsewhere. These broken aliases were removed.Multimodal
validate_no_image_in_assistant: The multimodal conversation tokenization previously had an inlineIMAGE_TOKENcheck only for the assistant role. The shared implementation uses thevalidate_no_image_in_assistantflag onPromptConfig, making the behavior explicit and configurable per format.Multimodal
capitalize_roles: The nemotron5-aligned format had inline role capitalization inmultimodal_tokenizer.py. This is now aPromptConfig.capitalize_rolesflag, handled by the shared conversation tokenizer.New User Experience
Files Changed
megatron_tokenizer.pyTOKENIZER_MAPPING_NAMES, remove"sft"fromTEXT_LIBRARIES, replaceimportlibdispatch with direct class resolutiontext/text_tokenizer.pyclass_name/class_pathdefaults, add_prompt_configSFT capability, rewritetokenize_conversation, addpad/eodoverrides, remove("sft", "SFTTokenizer")text/libraries/abstract_tokenizer.pyapply_chat_template(from mixin), add concretetoken_to_iddefault, remove broken property aliasestext/libraries/sentencepiece_tokenizer.pyMegatronTokenizerChatTemplatemixin inheritancetext/libraries/tiktoken_tokenizer.pyMegatronTokenizerChatTemplatemixin inheritancetext/libraries/__init__.pySFTTokenizerimporttext/models/__init__.pyvision/models/__init__.pyDefaultTokenizerVisionimport with backward-compat aliasvision/vision_tokenizer.pyclass_name/class_pathdefaultsvision/libraries/multimodal_tokenizer.pyconversation/moduleutils/build_tokenizer.py_resolve_library,_resolve_tokenizer_path,_build_library_kwargs,_build_mode_kwargstraining/arguments.py--tokenizer-library,--tokenizer-mode,--tokenizer-prompt-formatflags; add--tokenizer-typedeprecation mappingtests/.../test_tokenizer.pytest_sft_tokenizerto uselibrary: huggingface+prompt_formatFiles Created
conversation/__init__.pyconversation/conversation_tokenizer.pyconversation/prompt_config.pyPromptConfigdataclass, chat templates,PROMPT_FORMAT_REGISTRY, agnostic helpersFiles Deleted
text/libraries/sft_tokenizer.py(254 lines)MegatronTokenizerText, not a separate librarytext/libraries/chat_template.py(71 lines)MegatronTokenizerTextAbstract.apply_chat_templatetext/models/bert_tokenizer.py(12 lines)text/models/gpt_tokenizer.py(12 lines)text/models/mamba_tokenizer.py(12 lines)text/models/t5_tokenizer.py(12 lines)text/models/default_tokenizer.py(12 lines)vision/models/default_tokenizer.py(12 lines)Contribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.