Skip to content

llama : reuse compute graphs #14482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

llama : reuse compute graphs #14482

wants to merge 1 commit into from

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Jul 1, 2025

target #14285

PoC for reusing computation graphs. Works with any batch sizes and is to a large extend generic.

This functionality requires the ggml_set_rows() operator to be supported (see #14285). In order to be able to reuse a compute graph, its topology (shapes, strides, parameters, etc.) has to be entirely defined by the set of input tensors (e.g. inp_embd, inp_pos, inp_attn, etc.). This PR adds logic that after constructing all input tensors, we compare them with the input tensors used in the previous graph and if they match we return early from the graph building function and reuse the previous graph. For this to work, we should no longer preemptively reset the scheduler after processing a batch so that all buffers from the previous graph remain allocated and ready for reuse, in case the new ubatch is compatible.

The other change that is needed is to introduce a way to swap the llama_memory_context of all graph inputs, so that the new call to llm_graph_result_i::set_inputs() uses the correct context from the current ubatch. To achieve this, we extend llm_graph_result_i with an update method:

llama.cpp/src/llama-graph.h

Lines 463 to 469 in f61b0f7

void update(llama_memory_context_i * mctx) override {
for (auto & input : inputs) {
input->update(mctx);
}
}

There are still some rough edges to polish, but the general idea should be visible.

Note: Technically, we don't even have to construct the input tensors explicitly as currently proposed. We just have to check that the parameters that determine their shapes (such as, n_tokens, n_outputs, n_kv, etc.) are the same as last time. But performing the check over the set of input tensors seems one step safer and less error-prone.

Enabled models

  • Llama
  • Qwen 2.5
  • Gemma 3

To enable other models, add the following check when building the compute graph:

llama.cpp/src/llama-model.cpp

Lines 4830 to 4837 in f61b0f7

// if the graph supports reusing, we perform the check after creating all input tensors
// important: make sure that no input tenors are created after this point
if (res_prv && res->is_same(res_prv)) {
can_reuse = true;
return;
}

And implement any missing llm_graph_input_i::is_same() methods.

Tests

LLAMA_SET_ROWS=1 ./bin/llama-cli -m ../models/llama-3.2-3b-instruct/ggml-model-q8_0.gguf -p "I believe the meaning of life is" -n 32 --top-k 1 -fa

LLAMA_SET_ROWS=1 ./bin/llama-parallel -m ../models/qwen2.5-3b-coder/ggml-model-q8_0.gguf -np 8 -ns 128 -s 1 -c 4096 -fa -n 128

Benchmark on M2 Ultra:

LLAMA_SET_ROWS=1 ./scripts/compare-commits.sh gg/kv-cache-use-set-rows gg/llama-reuse-graphs -m models/qwen2.5-3b-coder/ggml-model-q8_0.gguf -m models/qwen2.5-3b-coder/ggml-model-q4_0.gguf -m models/qwen2.5-1.5b-coder/ggml-model-q4_0.gguf -m models/qwen2.5-1.5b-coder/ggml-model-q8_0.gguf -m models/gemma-3-4b/ggml-model-q4_0.gguf -m models/llama-3.2-1b-instruct/ggml-model-q8_0.gguf -fa 0,1 -t 1 -r 10
Model FA Test t/s gg/kv-cache-use-set-rows t/s gg/llama-reuse-graphs Speedup
gemma3 4B Q4_0 No pp512 2558.21 2562.49 1.00
gemma3 4B Q4_0 No tg128 116.88 120.59 1.03
gemma3 4B Q4_0 Yes pp512 2455.14 2460.37 1.00
gemma3 4B Q4_0 Yes tg128 117.09 120.85 1.03
llama 1B Q8_0 No pp512 7651.88 7658.07 1.00
llama 1B Q8_0 No tg128 259.10 266.79 1.03
llama 1B Q8_0 Yes pp512 8034.09 8057.90 1.00
llama 1B Q8_0 Yes tg128 276.17 283.81 1.03
qwen2 1.5B Q4_0 No pp512 5864.40 5864.61 1.00
qwen2 1.5B Q4_0 No tg128 204.23 211.84 1.04
qwen2 1.5B Q4_0 Yes pp512 6051.94 6070.36 1.00
qwen2 1.5B Q4_0 Yes tg128 213.10 221.57 1.04
qwen2 1.5B Q8_0 No pp512 5799.81 5832.32 1.01
qwen2 1.5B Q8_0 No tg128 172.15 178.75 1.04
qwen2 1.5B Q8_0 Yes pp512 5973.31 5984.35 1.00
qwen2 1.5B Q8_0 Yes tg128 178.22 184.33 1.03
qwen2 3B Q4_0 No pp512 2907.74 2913.05 1.00
qwen2 3B Q4_0 No tg128 140.73 146.11 1.04
qwen2 3B Q4_0 Yes pp512 2969.73 2976.17 1.00
qwen2 3B Q4_0 Yes tg128 146.36 151.68 1.04
qwen2 3B Q8_0 No pp512 2841.67 2855.64 1.00
qwen2 3B Q8_0 No tg128 110.94 114.28 1.03
qwen2 3B Q8_0 Yes pp512 2911.73 2918.46 1.00
qwen2 3B Q8_0 Yes tg128 114.24 117.33 1.03

TODO

  • Clean-up and improve new interfaces and members
  • Avoid graph input dynamic casts in is_same methods?
  • Allow to reuse more models
  • Manual user option to force disable of graph reuse?

@ggerganov ggerganov mentioned this pull request Jul 1, 2025
5 tasks
@rgerganov rgerganov marked this pull request as ready for review July 2, 2025 06:04
@ggerganov ggerganov force-pushed the gg/kv-cache-use-set-rows branch from 2f577c5 to 30b4d4e Compare July 2, 2025 12:49
Base automatically changed from gg/kv-cache-use-set-rows to master July 3, 2025 07:53
@ggerganov ggerganov force-pushed the gg/llama-reuse-graphs branch from f61b0f7 to d9e1781 Compare July 3, 2025 08:00
@gabe-l-hart gabe-l-hart mentioned this pull request Jul 3, 2025
3 tasks
@ggerganov ggerganov marked this pull request as draft July 4, 2025 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants