Skip to content

Server: Cache position calculation error(#12160) #12161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 5, 2025

Conversation

Clauszy
Copy link
Contributor

@Clauszy Clauszy commented Mar 3, 2025

Bug for cache reuse:When using the llama_kv_cache_seq_rm function, the positions of tokens after head_c are offset due to the kv_shift. If head_c is updated incorrectly or not properly adjusted after the shift, it may cause valid tokens to be removed in subsequent operations. Here's a clear explanation of the process:

  1. Initial KV Cache State:

    Cache Tokens:   a b c d e f g h j
    Cell Positions: 0 1 2 3 4 5 6 7 8
    New Tokens: a b e f h j
                0 1 - - - -
    
  2. First Operation:

    • head_p is set to 2, and head_c is also set to 2.
    • The token 'e' is found, so head_c is updated to 4, and n_match is set to 2.
    • kv_shift is set to -2.
    • Tokens from head_p to head_c (positions 2 to 4: tokens 'c', 'd') are removed.
      Cache Tokens:   a b c d e f g h j
      Cell Positions: 0 1 - - 4 5 6 7 8
      
    • The remaining tokens' positions are updated by adding kv_shift (-2):
      Cache Tokens:   a b c d e f g h j
      Cell Positions: 0 1 - - 2 3 4 5 6
      
    • head_p is updated to head_p + n_match (2 + 2 = 4).
    • head_c is updated to head_c + n_match (4 + 2 = 6).
  3. Second Operation:

    • head_p is 4, and head_c is 6.

    • The token 'h' is found, so head_c is updated to 7.

    • Tokens from head_p to head_c (positions 4 to 7: tokens 'g', 'h', 'j') are removed.

      Cache Tokens:   a b c d e f g h j
      Cell Positions: 0 1 - - 2 3 - - -
      
    • After this operation, valid tokens('h', 'j') in the cache are removed because their positions have been shifted incorrectly.

This demonstrates how improper handling of kv_shift and head_c updates can lead to the unintended removal of valid tokens in the KV cache.

The first kv shift offsets the positions of all tokens after head_c.
When using llama_kv_cache_seq_rm next, using head_c will remove the valid tokens because their positions have already been offset.
@Clauszy Clauszy requested a review from ngxson as a code owner March 3, 2025 12:57
@ggerganov
Copy link
Member

Nice catch!

After this operation, valid tokens('g', 'h') in the cache are removed because their positions have been shifted incorrectly.

This should be "... valid tokens ('h', 'j') ...", correct?

@Clauszy
Copy link
Contributor Author

Clauszy commented Mar 4, 2025

Nice catch!

After this operation, valid tokens('g', 'h') in the cache are removed because their positions have been shifted incorrectly.

This should be "... valid tokens ('h', 'j') ...", correct?

Yes,valid tokens is 'h', 'j'

@Clauszy
Copy link
Contributor Author

Clauszy commented Mar 5, 2025

@ggerganov @ngxson Can this commit be merged?

@ggerganov
Copy link
Member

ggerganov commented Mar 5, 2025

I was giving this branch a run yesterday and it seems to work correct. I think this bug might have been the reason why I observed that small --cache-reuse values do not work that well.

Here is a small demonstration:

# start server with cache reuse enabled
./bin/llama-server -m ../models/qwen2.5-3b-coder/ggml-model-q8_0.gguf --cache-reuse 1 --port 8010

# process first message that contains "The quick brown fox jumps over the lazy dog" as sub-sequence
curl \
    --request POST --url http://127.0.0.1:8010/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "The quick reaction was followed by something. His brown fox started eating the apples. Wolf jumps over sheep in the forest. She was the lazy one.", "n_predict": 1, "cache_prompt": true, "temperature": 0.0}' | jq

# now reuse the tokens to process the sub-sequence only
curl \
    --request POST --url http://127.0.0.1:8010/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "The quick brown fox jumps over the lazy", "n_predict": 1, "cache_prompt": true, "temperature": 0.0}' | jq

We expect the second query to generate the token dog by reusing the full prompt from the previous request. On master this fails (i.e. does not generate the dog token), but it works correctly with this PR.

@ggerganov ggerganov merged commit 06a92a1 into ggml-org:master Mar 5, 2025
47 checks passed
mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025
The first kv shift offsets the positions of all tokens after head_c.
When using llama_kv_cache_seq_rm next, using head_c will remove the valid tokens because their positions have already been offset.
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Mar 19, 2025
The first kv shift offsets the positions of all tokens after head_c.
When using llama_kv_cache_seq_rm next, using head_c will remove the valid tokens because their positions have already been offset.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants