Skip to content

chunked prefill implementation is incorrect #43082

@SmerkyG

Description

@SmerkyG

System Info

Latest and older chunked prefill implementations unfortunately contain a significant error, most recently in _prefill:
https://github.com/huggingface/transformers/blob/v5.0.0rc1/src/transformers/generation/utils.py#L3849

position_ids is set incorrectly in the chunked prefill codepath, leading to the wrong RoPE application and bad outputs.

                model_kwargs["position_ids"] = model_kwargs["cache_position"].unsqueeze(0)

This sets the position ids to the chunk being cached, which does not account for all the positions.

Omitting this line fixes the problem, but you may want to initialize the position_ids or even decoder_position_ids for some edge-case use cases. I was unable to find such an initialization that worked and seemed to match the rest of the inits in other parts of the generation utils code.

Who can help?

@gante

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Run lm eval harness or any eval using a specified prefill_chunk_size, even 999999.

Expected behavior

Gets wrong outputs because the cache positions are specified as being incorrect, leading to wrong RoPE positions being applied.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions