Skip to content

Conversation

@aijadugar
Copy link

What does this PR do?

This PR fixes an issue where saving a custom Processor that includes multiple sub-tokenizers of the same type caused them to overwrite each other during serialization.

The root cause was that all sub-components were being saved using the same default filenames, leading to collisions.

This update introduces unique naming and loading logic in the ProcessorMixin save/load methods, allowing processors with multiple tokenizers to be safely saved and reloaded without data loss.

Fixes #41816

Before submitting

I have read the contributor guideline.

The change was discussed in issue #41816.

I’ve tested the processor save/load logic locally with multiple tokenizers.

No documentation changes were required.

Added/verified tests for multiple sub-tokenizers loading correctly.

Who can review?

Tagging maintainers familiar with processor and tokenizer internals:

@Cyrilvallez

@ArthurZucker

@AmitMY
Copy link

AmitMY commented Oct 24, 2025

Running the example from the issue, I get:

Traceback (most recent call last):
  File "/Users/amitmoryossef/dev/sign/visual-text-decoder/example.py", line 32, in <module>
    processor.save_pretrained(save_directory=temp_dir, push_to_hub=False)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/visual_text_decoder/lib/python3.12/site-packages/transformers/processing_utils.py", line 804, in save_pretrained
    attribute.save_pretrained(attribute_save_dir, save_jinja_files=save_jinja_files)
                                                                   ^^^^^^^^^^^^^^^^
NameError: name 'save_jinja_files' is not defined

@aijadugar
Copy link
Author

hi, @AmitMY , can you try it once again!

@AmitMY
Copy link

AmitMY commented Oct 24, 2025

Strangely, now the error is

AttributeError: GemmaTokenizerFast has no attribute to_dict

It would be good if you add a test for your code.

Create tests/utils/test_processor_utils.py

import tempfile

from transformers.testing_utils import TestCasePlus

from transformers import ProcessorMixin, AutoTokenizer, PreTrainedTokenizer


class ProcessorSavePretrainedMultipleAttributes(TestCasePlus):
    def test_processor_loads_separate_attributes(self):
        class OtherProcessor(ProcessorMixin):
            name = "other-processor"

            attributes = [
                "tokenizer1",
                "tokenizer2",
            ]
            tokenizer1_class = "AutoTokenizer"
            tokenizer2_class = "AutoTokenizer"

            def __init__(self,
                         tokenizer1: PreTrainedTokenizer,
                         tokenizer2: PreTrainedTokenizer
                         ):
                super().__init__(tokenizer1=tokenizer1,
                                 tokenizer2=tokenizer2)

        tokenizer1 = AutoTokenizer.from_pretrained("google/gemma-3-270m")
        tokenizer2 = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B")

        processor = OtherProcessor(tokenizer1=tokenizer1,
                                   tokenizer2=tokenizer2)
        assert processor.tokenizer1.__class__ != processor.tokenizer2.__class__

        with tempfile.TemporaryDirectory() as temp_dir:
            # Save processor
            processor.save_pretrained(save_directory=temp_dir, push_to_hub=False)
            # Load processor
            new_processor = OtherProcessor.from_pretrained(temp_dir)

        assert new_processor.tokenizer1.__class__ != new_processor.tokenizer2.__class__

Copy link

@AmitMY AmitMY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! now i guess need to make all the other tests pass...

Copy link

@AmitMY AmitMY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look at the diff. ideally, don't change things that are not relevant

@AmitMY
Copy link

AmitMY commented Oct 26, 2025

W291 [*] Trailing whitespace
   --> src/transformers/processing_utils.py:658:38
    |
656 |         """
657 |         # shallow copy to avoid deepcopy errors
658 |         output = self.__dict__.copy()  
    |                                      ^^
659 |
660 |         # Get the kwargs in `__init__`.
    |
help: Remove trailing whitespace

F821 Undefined name `save_jinja_files`
   --> src/transformers/processing_utils.py:792:80
    |
790 |                 attribute_save_dir = os.path.join(save_directory, attribute_name)
791 |                 os.makedirs(attribute_save_dir, exist_ok=True)
792 |                 attribute.save_pretrained(attribute_save_dir, save_jinja_files=save_jinja_files)
    |                                                                                ^^^^^^^^^^^^^^^^
793 |             elif attribute._auto_class is not None:
794 |                 custom_object_save(attribute, save_directory, config=attribute)
    |

W291 [*] Trailing whitespace
    --> src/transformers/processing_utils.py:1423:68
     |
1421 |                 attribute_class = cls.get_possibly_dynamic_module(class_name)
1422 |
1423 |             # updated loading path for handling multiple tokenizers 
     |                                                                    ^
1424 |             attribute_path = os.path.join(pretrained_model_name_or_path, attribute_name)
1425 |             if os.path.isdir(attribute_path):
     |
help: Remove trailing whitespace

I001 [*] Import block is un-sorted or un-formatted
 --> tests/test_processor_utils.py:1:1
  |
1 | / import tempfile
2 | |
3 | | from transformers.testing_utils import TestCasePlus
4 | | from transformers import ProcessorMixin, AutoTokenizer, PreTrainedTokenizer
  | |___________________________________________________________________________^
  |
help: Organize imports

Found 4 errors.
[*] 3 fixable with the `--fix` option.

@aijadugar
Copy link
Author

hellow @AmitMY , thanks for your continuous feedback and detailed reports!...

@AmitMY
Copy link

AmitMY commented Oct 26, 2025

Tests still fail

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Processor saving does not work when multiple tokenizers

2 participants