[Bug] qwen2_5_omni: Hardcoded noise_initialization length of '30000' causes shape mismatch

### System Info

GPU: B300 GPU,
OS: Ubuntu 24.04 + CUDA 12.8 
Transformer version:  buid source from the latest main branch

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

To produce the bug: 

Run the script(I ran it on B300 GPU) on main branch
[test_before.py](https://github.com/user-attachments/files/24402952/test_before.py)

The error log file:
[before_fix.log](https://github.com/user-attachments/files/24402956/before_fix.log)

```
  File "/root/transformers/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/transformers/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/transformers/src/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 3636, in forward
    hidden_states = self.input_embed(
                    ^^^^^^^^^^^^^^^^^
  File "/root/transformers/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/transformers/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/transformers/src/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 2975, in forward
    hidden_states = self.proj(torch.cat((hidden_states, condition_vector, code_embed, speaker_embedding), dim=-1))
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 30000 but got size 32000 for tensor number 2 in the list.
```

The issue is that `noise_initialization` is initialized as a hardcoded 30000 as time length, 
```
        noise_initialization = torch.randn([1, 30000, self.mel_dim], dtype=reference_mel_spectrogram.dtype)
        maximum_duration = quantized_code.shape[1] * self.repeats
        initial_state = noise_initialization[:, :maximum_duration].to(quantized_code.device)
```
when `quantized_code` time length exceeds 30000 and then later in the DiTInputEmbeddeing forward stage they will be concatnated in the time len dimension as:
```
        hidden_states = self.proj(torch.cat((hidden_states, condition_vector, code_embed, speaker_embedding), dim=-1)) 
```
a concat of shape of (1, 30000, hidden dim) and shape of  (1, 65536, code dim) will cause the shape mismatch error.


### Expected behavior

Noise_initialization shouldn't be hardcoded, instead it should be set as an inference time arguments provided by the user.  And when the time len mismatches, the application can cap and align the time len dimension to continue generating audio.

The fix PR is at: https://github.com/huggingface/transformers/pull/43068


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] qwen2_5_omni: Hardcoded noise_initialization length of '30000' causes shape mismatch #43079

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] qwen2_5_omni: Hardcoded noise_initialization length of '30000' causes shape mismatch #43079

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions