Skip to content

Different tokenization with same tokenizer from 4.57.3 to 5.0 #43122

@awni

Description

@awni

System Info

Moving from transformers 4.57.3 to 5.0+ introduces a different and seemingly incorrect tokenization when using the same tokenizer.

I believe the new version is incorrect because when using it, we get bad results (the model starts to introduce unexpected artifacts in the response).

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Run the following with the two versions and compare the tokenized prompt.

from transformers import AutoTokenizer                                         
                                                                               
tokenizer = AutoTokenizer.from_pretrained("mlx-community/MiniMax-M2.1-4bit")   
messages = [                                                                   
    {                                                                          
        "role": "system",                                                      
        "content": '"You are opencode, an interactive CLI tool that helps users with software engineering tasks. Use the instructions below and the tools available to you to assist the user.\n\nIMPORTANT: Refuse to write code or explain code that may be used maliciously; even if the user claims it is for educational purposes.'
    }
]
    
prompt = tokenizer.apply_chat_template(                                        
    messages, add_generation_prompt=True, tokenize=True, return_dict=False     
)
print(prompt) 

Expected behavior

They should be the same.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions