chunked prefill implementation is incorrect

### System Info

Latest and older chunked prefill implementations unfortunately contain a significant error, most recently in _prefill:
https://github.com/huggingface/transformers/blob/v5.0.0rc1/src/transformers/generation/utils.py#L3849

`position_ids` is set incorrectly in the chunked prefill codepath, leading to the wrong RoPE application and bad outputs.

```
                model_kwargs["position_ids"] = model_kwargs["cache_position"].unsqueeze(0)
```
This sets the position ids to the chunk being cached, which does not account for all the positions.

Omitting this line fixes the problem, but you may want to initialize the position_ids or even decoder_position_ids for some edge-case use cases. I was unable to find such an initialization that worked and seemed to match the rest of the inits in other parts of the generation utils code.


### Who can help?

@gante 

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

Run lm eval harness or any eval using a specified prefill_chunk_size, even 999999.


### Expected behavior

Gets wrong outputs because the cache positions are specified as being incorrect, leading to wrong RoPE positions being applied.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chunked prefill implementation is incorrect #43082

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

chunked prefill implementation is incorrect #43082

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions