-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Labels
Description
System Info
transformersversion: 5.0.0.dev0- Platform: Linux-6.14.0-1021-gcp-x86_64-with-glibc2.39
- Python version: 3.12.3
- Huggingface_hub version: 1.2.3
- Safetensors version: 0.7.0
- Accelerate version: 1.12.0
- Accelerate config: not found
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.9.0+cu130 (CUDA)
- Using distributed or parallel set-up in script?: No, TP=1
- Using GPU in script?: Yes
- GPU type: NVIDIA RTX PRO 6000 Blackwell Server Edition
Who can help?
@zucchini-nlp @ArthurZucker @itazap
When running VLLM with above Mistral Model, including this VLLM PR for 5.0.0, I have the following error:
(EngineCore_DP0 pid=13140) File "/home/user/vllm-venv/lib/python3.12/site-packages/transformers/models/pixtral/processing_pixtral.py", line 124, in __init__
(EngineCore_DP0 pid=13140) self.image_token_id = tokenizer.convert_tokens_to_ids(self.image_token)
(EngineCore_DP0 pid=13140) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13140) AttributeError: 'MistralTokenizer' object has no attribute 'convert_tokens_to_ids'. Did you mean: 'convert_tokens_to_string'?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Use case:
- Install:
uv pip install setuptools packaging
uv pip install https://github.com/vllm-project/vllm/releases/download/v0.13.0/vllm-0.13.0+cu130-cp38-abi3-manylinux_2_35_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu130
uv pip install datasets psutil pandas torch-c-dlpack-ext
- Run:
VLLM_USE_FLASHINFER_MOE_FP4=1 ENABLE_NVFP4_SM120=1 VLLM_ATTENTION_BACKEND=FLASHINFER vllm serve RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4 --tensor-parallel-size=1 --no-enable-prefix-caching --kv-cache-dtype=fp8 --seed=42 --disable-log-requests --download_dir /home/user/models --max-model-len=8192 --max-num-batched-tokens=2048 --max-num-seqs=256 --gpu-memory-utilization=0.97
Expected behavior
Having VLLM properly running with this model.