multiple tokenizers with different filenames can save now #41837

aijadugar · 2025-10-24T03:23:11Z

What does this PR do?

This PR fixes an issue where saving a custom Processor that includes multiple sub-tokenizers of the same type caused them to overwrite each other during serialization.

The root cause was that all sub-components were being saved using the same default filenames, leading to collisions.

This update introduces unique naming and loading logic in the ProcessorMixin save/load methods, allowing processors with multiple tokenizers to be safely saved and reloaded without data loss.

Fixes #41816

Before submitting

I have read the contributor guideline.

The change was discussed in issue #41816.

I’ve tested the processor save/load logic locally with multiple tokenizers.

No documentation changes were required.

Added/verified tests for multiple sub-tokenizers loading correctly.

Who can review?

Tagging maintainers familiar with processor and tokenizer internals:

@Cyrilvallez

@ArthurZucker

AmitMY · 2025-10-24T07:49:58Z

Running the example from the issue, I get:

Traceback (most recent call last):
  File "/Users/amitmoryossef/dev/sign/visual-text-decoder/example.py", line 32, in <module>
    processor.save_pretrained(save_directory=temp_dir, push_to_hub=False)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/visual_text_decoder/lib/python3.12/site-packages/transformers/processing_utils.py", line 804, in save_pretrained
    attribute.save_pretrained(attribute_save_dir, save_jinja_files=save_jinja_files)
                                                                   ^^^^^^^^^^^^^^^^
NameError: name 'save_jinja_files' is not defined

aijadugar · 2025-10-24T14:29:31Z

hi, @AmitMY , can you try it once again!

AmitMY · 2025-10-24T14:38:17Z

Strangely, now the error is

AttributeError: GemmaTokenizerFast has no attribute to_dict

It would be good if you add a test for your code.

Create tests/utils/test_processor_utils.py

import tempfile

from transformers.testing_utils import TestCasePlus

from transformers import ProcessorMixin, AutoTokenizer, PreTrainedTokenizer


class ProcessorSavePretrainedMultipleAttributes(TestCasePlus):
    def test_processor_loads_separate_attributes(self):
        class OtherProcessor(ProcessorMixin):
            name = "other-processor"

            attributes = [
                "tokenizer1",
                "tokenizer2",
            ]
            tokenizer1_class = "AutoTokenizer"
            tokenizer2_class = "AutoTokenizer"

            def __init__(self,
                         tokenizer1: PreTrainedTokenizer,
                         tokenizer2: PreTrainedTokenizer
                         ):
                super().__init__(tokenizer1=tokenizer1,
                                 tokenizer2=tokenizer2)

        tokenizer1 = AutoTokenizer.from_pretrained("google/gemma-3-270m")
        tokenizer2 = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B")

        processor = OtherProcessor(tokenizer1=tokenizer1,
                                   tokenizer2=tokenizer2)
        assert processor.tokenizer1.__class__ != processor.tokenizer2.__class__

        with tempfile.TemporaryDirectory() as temp_dir:
            # Save processor
            processor.save_pretrained(save_directory=temp_dir, push_to_hub=False)
            # Load processor
            new_processor = OtherProcessor.from_pretrained(temp_dir)

        assert new_processor.tokenizer1.__class__ != new_processor.tokenizer2.__class__

AmitMY

Cool! now i guess need to make all the other tests pass...

src/transformers/processing_utils.py

… comments and serialization behavior

AmitMY

look at the diff. ideally, don't change things that are not relevant

src/transformers/processing_utils.py

Moved docstring for clarity and added it to the cast_array_to_list function.

…r save logic

AmitMY · 2025-10-26T15:39:30Z

W291 [*] Trailing whitespace
   --> src/transformers/processing_utils.py:658:38
    |
656 |         """
657 |         # shallow copy to avoid deepcopy errors
658 |         output = self.__dict__.copy()  
    |                                      ^^
659 |
660 |         # Get the kwargs in `__init__`.
    |
help: Remove trailing whitespace

F821 Undefined name `save_jinja_files`
   --> src/transformers/processing_utils.py:792:80
    |
790 |                 attribute_save_dir = os.path.join(save_directory, attribute_name)
791 |                 os.makedirs(attribute_save_dir, exist_ok=True)
792 |                 attribute.save_pretrained(attribute_save_dir, save_jinja_files=save_jinja_files)
    |                                                                                ^^^^^^^^^^^^^^^^
793 |             elif attribute._auto_class is not None:
794 |                 custom_object_save(attribute, save_directory, config=attribute)
    |

W291 [*] Trailing whitespace
    --> src/transformers/processing_utils.py:1423:68
     |
1421 |                 attribute_class = cls.get_possibly_dynamic_module(class_name)
1422 |
1423 |             # updated loading path for handling multiple tokenizers 
     |                                                                    ^
1424 |             attribute_path = os.path.join(pretrained_model_name_or_path, attribute_name)
1425 |             if os.path.isdir(attribute_path):
     |
help: Remove trailing whitespace

I001 [*] Import block is un-sorted or un-formatted
 --> tests/test_processor_utils.py:1:1
  |
1 | / import tempfile
2 | |
3 | | from transformers.testing_utils import TestCasePlus
4 | | from transformers import ProcessorMixin, AutoTokenizer, PreTrainedTokenizer
  | |___________________________________________________________________________^
  |
help: Organize imports

Found 4 errors.
[*] 3 fixable with the `--fix` option.

aijadugar · 2025-10-26T17:41:45Z

hellow @AmitMY , thanks for your continuous feedback and detailed reports!...

AmitMY · 2025-10-26T17:44:25Z

Tests still fail

multiple tokenizers with different filenames can save now

73f91c5

shallow copy to avoid deepcopy errors

c2ab93f

aijadugar force-pushed the fix-multiple-tokenizers-saved branch from a9b0d95 to c2ab93f Compare October 24, 2025 18:24

aijadugar mentioned this pull request Oct 24, 2025

Fix deepcopy in ProcessorMixin.to_dict for GemmaTokenizerFast #41851

Open

AmitMY suggested changes Oct 25, 2025

View reviewed changes

aijadugar added 2 commits October 25, 2025 17:40

Fix processor save logic for multiple tokenizers and restore original…

93731c5

… comments and serialization behavior

Merge branch 'main' into fix-multiple-tokenizers-saved

c94405a

AmitMY suggested changes Oct 25, 2025

View reviewed changes

src/transformers/processing_utils.py Show resolved Hide resolved

aijadugar added 6 commits October 26, 2025 08:45

Revert unrelated edits and keep only relevant processor save logic

67c3bb7

Moved docstring for clarity and added it to the cast_array_to_list function.

Merge branch 'main' into fix-multiple-tokenizers-saved

8aca431

Restore original comments and docstrings; keep only relevant processo…

560067e

…r save logic

Restore original comments and keep only relevant processor save logic

68caa7d

Restore original comments

4b2c049

Restore original comments

9e4b141

aijadugar added 2 commits October 26, 2025 23:06

removes trailling whitespaces

1ffb4d3

import like ruff's isort rule

8203b0e

aijadugar mentioned this pull request Oct 27, 2025

Fix/processor multiple tokenizers #41879

Open

Merge branch 'main' into fix-multiple-tokenizers-saved

e1ce3e7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

multiple tokenizers with different filenames can save now #41837

multiple tokenizers with different filenames can save now #41837

aijadugar commented Oct 24, 2025

Uh oh!

AmitMY commented Oct 24, 2025

Uh oh!

aijadugar commented Oct 24, 2025

Uh oh!

AmitMY commented Oct 24, 2025

Uh oh!

AmitMY left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AmitMY left a comment

Uh oh!

Uh oh!

AmitMY commented Oct 26, 2025

Uh oh!

aijadugar commented Oct 26, 2025

Uh oh!

AmitMY commented Oct 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

multiple tokenizers with different filenames can save now #41837

Are you sure you want to change the base?

multiple tokenizers with different filenames can save now #41837

Conversation

aijadugar commented Oct 24, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

AmitMY commented Oct 24, 2025

Uh oh!

aijadugar commented Oct 24, 2025

Uh oh!

AmitMY commented Oct 24, 2025

Uh oh!

AmitMY left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AmitMY left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AmitMY commented Oct 26, 2025

Uh oh!

aijadugar commented Oct 26, 2025

Uh oh!

AmitMY commented Oct 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants