System Info
transformers: 5.2.0 vs 5.4.0 (I could not test 5.3.0 due to a bug)
PyTorch: 2.12
flash-attn: 2.8.3
CUDA: 13.2
Python: 3.12
cudnn: 9.20.0.48
Who can help?
@zucchini-nlp
@Cyrilvallez
Information
Tasks
Reproduction
- Install the latest version of transformers
- Install FlashAttention2
- Run the provided code
- Downgrade to transformers version 5.2.0
- Run the code again
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
model = Qwen3VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen3-VL-2B-Thinking",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
).to("cuda")
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-2B-Thinking")
messages = [
[
{
"role": "user",
"content": [
{"type": "video", "video": torch.zeros(768, 3, 512, 512)},
{"type": "text", "text": "placeholder"},
],
}
],
[
{
"role": "user",
"content": [
{"type": "video", "video": torch.zeros(768, 3, 512, 512)},
{"type": "text", "text": "different placeholder"},
],
}
],
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
do_sample_frames=False,
return_dict=True,
return_tensors="pt",
padding=True,
padding_side="left",
).to("cuda")
MAX_LEN = 16 * 1024
while True:
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
out_ids = model.generate(
**inputs, do_sample=False, min_length=MAX_LEN, max_length=MAX_LEN, use_cache=True
)
end.record()
torch.cuda.synchronize()
assert out_ids.size(0) == 2 and out_ids.size(1) == MAX_LEN
cur_time = start.elapsed_time(end)
print(cur_time)
Expected behavior
Tested on an NVIDIA RTX 6000 PRO:
- v5.2.0 FA2 = 92 sec
- v5.4.0 FA2 = 108 sec (+17% overhead)
- v5.2.0 SDPA = 136 sec
- v5.4.0 SDPA = 136 sec (no change)
The performance regression seems to occur when the KV cache is sufficiently large, at which point the GPU memory bandwidth has been saturated. Furthermore, it seems to be related with the changes to the FA2 integration layer. It does not seem to be related with the nvidia-cudnn conv kernel performance regression which was fixed in a recent PyTorch version.
System Info
transformers: 5.2.0 vs 5.4.0 (I could not test 5.3.0 due to a bug)
PyTorch: 2.12
flash-attn: 2.8.3
CUDA: 13.2
Python: 3.12
cudnn: 9.20.0.48
Who can help?
@zucchini-nlp
@Cyrilvallez
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Tested on an NVIDIA RTX 6000 PRO:
The performance regression seems to occur when the KV cache is sufficiently large, at which point the GPU memory bandwidth has been saturated. Furthermore, it seems to be related with the changes to the FA2 integration layer. It does not seem to be related with the nvidia-cudnn conv kernel performance regression which was fixed in a recent PyTorch version.