-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Description
System Info
GPU: B300 GPU,
OS: Ubuntu 24.04 + CUDA 12.8
Transformer version: buid source from the latest main branch
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
To produce the bug:
Run the script(I ran it on B300 GPU) on main branch
test_before.py
The error log file:
before_fix.log
File "/root/transformers/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/transformers/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/transformers/src/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 3636, in forward
hidden_states = self.input_embed(
^^^^^^^^^^^^^^^^^
File "/root/transformers/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/transformers/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/transformers/src/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 2975, in forward
hidden_states = self.proj(torch.cat((hidden_states, condition_vector, code_embed, speaker_embedding), dim=-1))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 30000 but got size 32000 for tensor number 2 in the list.
The issue is that noise_initialization is initialized as a hardcoded 30000 as time length,
noise_initialization = torch.randn([1, 30000, self.mel_dim], dtype=reference_mel_spectrogram.dtype)
maximum_duration = quantized_code.shape[1] * self.repeats
initial_state = noise_initialization[:, :maximum_duration].to(quantized_code.device)
when quantized_code time length exceeds 30000 and then later in the DiTInputEmbeddeing forward stage they will be concatnated in the time len dimension as:
hidden_states = self.proj(torch.cat((hidden_states, condition_vector, code_embed, speaker_embedding), dim=-1))
a concat of shape of (1, 30000, hidden dim) and shape of (1, 65536, code dim) will cause the shape mismatch error.
Expected behavior
Noise_initialization shouldn't be hardcoded, instead it should be set as an inference time arguments provided by the user. And when the time len mismatches, the application can cap and align the time len dimension to continue generating audio.
The fix PR is at: #43068