-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Description
System Info
Latest and older chunked prefill implementations unfortunately contain a significant error, most recently in _prefill:
https://github.com/huggingface/transformers/blob/v5.0.0rc1/src/transformers/generation/utils.py#L3849
position_ids is set incorrectly in the chunked prefill codepath, leading to the wrong RoPE application and bad outputs.
model_kwargs["position_ids"] = model_kwargs["cache_position"].unsqueeze(0)
This sets the position ids to the chunk being cached, which does not account for all the positions.
Omitting this line fixes the problem, but you may want to initialize the position_ids or even decoder_position_ids for some edge-case use cases. I was unable to find such an initialization that worked and seemed to match the rest of the inits in other parts of the generation utils code.
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Run lm eval harness or any eval using a specified prefill_chunk_size, even 999999.
Expected behavior
Gets wrong outputs because the cache positions are specified as being incorrect, leading to wrong RoPE positions being applied.