Skip to content

[Bug] qwen2_5_omni: Hardcoded noise_initialization length of '30000' causes shape mismatch #43079

@sniper35

Description

@sniper35

System Info

GPU: B300 GPU,
OS: Ubuntu 24.04 + CUDA 12.8
Transformer version: buid source from the latest main branch

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

To produce the bug:

Run the script(I ran it on B300 GPU) on main branch
test_before.py

The error log file:
before_fix.log

  File "/root/transformers/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/transformers/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/transformers/src/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 3636, in forward
    hidden_states = self.input_embed(
                    ^^^^^^^^^^^^^^^^^
  File "/root/transformers/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/transformers/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/transformers/src/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 2975, in forward
    hidden_states = self.proj(torch.cat((hidden_states, condition_vector, code_embed, speaker_embedding), dim=-1))
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 30000 but got size 32000 for tensor number 2 in the list.

The issue is that noise_initialization is initialized as a hardcoded 30000 as time length,

        noise_initialization = torch.randn([1, 30000, self.mel_dim], dtype=reference_mel_spectrogram.dtype)
        maximum_duration = quantized_code.shape[1] * self.repeats
        initial_state = noise_initialization[:, :maximum_duration].to(quantized_code.device)

when quantized_code time length exceeds 30000 and then later in the DiTInputEmbeddeing forward stage they will be concatnated in the time len dimension as:

        hidden_states = self.proj(torch.cat((hidden_states, condition_vector, code_embed, speaker_embedding), dim=-1)) 

a concat of shape of (1, 30000, hidden dim) and shape of (1, 65536, code dim) will cause the shape mismatch error.

Expected behavior

Noise_initialization shouldn't be hardcoded, instead it should be set as an inference time arguments provided by the user. And when the time len mismatches, the application can cap and align the time len dimension to continue generating audio.

The fix PR is at: #43068

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions