Server: Cache position calculation error(#12160) #12161

Clauszy · 2025-03-03T12:57:00Z

Bug for cache reuse：When using the llama_kv_cache_seq_rm function, the positions of tokens after head_c are offset due to the kv_shift. If head_c is updated incorrectly or not properly adjusted after the shift, it may cause valid tokens to be removed in subsequent operations. Here's a clear explanation of the process:

Initial KV Cache State:

Cache Tokens:   a b c d e f g h j
Cell Positions: 0 1 2 3 4 5 6 7 8
New Tokens: a b e f h j
            0 1 - - - -

First Operation:
- head_p is set to 2, and head_c is also set to 2.
- The token 'e' is found, so head_c is updated to 4, and n_match is set to 2.
- kv_shift is set to -2.
- Tokens from head_p to head_c (positions 2 to 4: tokens 'c', 'd') are removed.
```
Cache Tokens:   a b c d e f g h j
Cell Positions: 0 1 - - 4 5 6 7 8
```
- The remaining tokens' positions are updated by adding kv_shift (-2):
```
Cache Tokens:   a b c d e f g h j
Cell Positions: 0 1 - - 2 3 4 5 6
```
- head_p is updated to head_p + n_match (2 + 2 = 4).
- head_c is updated to head_c + n_match (4 + 2 = 6).
Second Operation:
- head_p is 4, and head_c is 6.
- The token 'h' is found, so head_c is updated to 7.
- Tokens from head_p to head_c (positions 4 to 7: tokens 'g', 'h', 'j') are removed.
```
Cache Tokens:   a b c d e f g h j
Cell Positions: 0 1 - - 2 3 - - -
```
- After this operation, valid tokens('h', 'j') in the cache are removed because their positions have been shifted incorrectly.

This demonstrates how improper handling of kv_shift and head_c updates can lead to the unintended removal of valid tokens in the KV cache.

The first kv shift offsets the positions of all tokens after head_c. When using llama_kv_cache_seq_rm next, using head_c will remove the valid tokens because their positions have already been offset.

ggerganov · 2025-03-03T13:43:58Z

Nice catch!

After this operation, valid tokens('g', 'h') in the cache are removed because their positions have been shifted incorrectly.

This should be "... valid tokens ('h', 'j') ...", correct?

Clauszy · 2025-03-04T01:32:38Z

Nice catch!

After this operation, valid tokens('g', 'h') in the cache are removed because their positions have been shifted incorrectly.

This should be "... valid tokens ('h', 'j') ...", correct?

Yes，valid tokens is 'h', 'j'

Clauszy · 2025-03-05T05:17:59Z

@ggerganov @ngxson Can this commit be merged?

ggerganov · 2025-03-05T07:23:50Z

I was giving this branch a run yesterday and it seems to work correct. I think this bug might have been the reason why I observed that small --cache-reuse values do not work that well.

Here is a small demonstration:

# start server with cache reuse enabled
./bin/llama-server -m ../models/qwen2.5-3b-coder/ggml-model-q8_0.gguf --cache-reuse 1 --port 8010

# process first message that contains "The quick brown fox jumps over the lazy dog" as sub-sequence
curl \
    --request POST --url http://127.0.0.1:8010/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "The quick reaction was followed by something. His brown fox started eating the apples. Wolf jumps over sheep in the forest. She was the lazy one.", "n_predict": 1, "cache_prompt": true, "temperature": 0.0}' | jq

# now reuse the tokens to process the sub-sequence only
curl \
    --request POST --url http://127.0.0.1:8010/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "The quick brown fox jumps over the lazy", "n_predict": 1, "cache_prompt": true, "temperature": 0.0}' | jq

We expect the second query to generate the token dog by reusing the full prompt from the previous request. On master this fails (i.e. does not generate the dog token), but it works correctly with this PR.

The first kv shift offsets the positions of all tokens after head_c. When using llama_kv_cache_seq_rm next, using head_c will remove the valid tokens because their positions have already been offset.

Server: Cache position calculation error(ggml-org#12160)

7b72745

The first kv shift offsets the positions of all tokens after head_c. When using llama_kv_cache_seq_rm next, using head_c will remove the valid tokens because their positions have already been offset.

Clauszy requested a review from ngxson as a code owner March 3, 2025 12:57

github-actions bot added examples server labels Mar 3, 2025

ggerganov merged commit 06a92a1 into ggml-org:master Mar 5, 2025
47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Server: Cache position calculation error(#12160) #12161

Server: Cache position calculation error(#12160) #12161

Uh oh!

Clauszy commented Mar 3, 2025 •

edited

Loading

Uh oh!

ggerganov commented Mar 3, 2025

Uh oh!

Clauszy commented Mar 4, 2025

Uh oh!

Clauszy commented Mar 5, 2025

Uh oh!

ggerganov commented Mar 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Server: Cache position calculation error(#12160) #12161

Server: Cache position calculation error(#12160) #12161

Uh oh!

Conversation

Clauszy commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Mar 3, 2025

Uh oh!

Clauszy commented Mar 4, 2025

Uh oh!

Clauszy commented Mar 5, 2025

Uh oh!

ggerganov commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clauszy commented Mar 3, 2025 •

edited

Loading

ggerganov commented Mar 5, 2025 •

edited

Loading