-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Open
Labels
Description
System Info
Moving from transformers 4.57.3 to 5.0+ introduces a different and seemingly incorrect tokenization when using the same tokenizer.
I believe the new version is incorrect because when using it, we get bad results (the model starts to introduce unexpected artifacts in the response).
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Run the following with the two versions and compare the tokenized prompt.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mlx-community/MiniMax-M2.1-4bit")
messages = [
{
"role": "system",
"content": '"You are opencode, an interactive CLI tool that helps users with software engineering tasks. Use the instructions below and the tools available to you to assist the user.\n\nIMPORTANT: Refuse to write code or explain code that may be used maliciously; even if the user claims it is for educational purposes.'
}
]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True, return_dict=False
)
print(prompt) Expected behavior
They should be the same.