llama : reuse compute graphs #14482

ggerganov · 2025-07-01T14:19:09Z

PoC for reusing computation graphs. Works with any batch sizes and is to a large extend generic.

This functionality requires the ggml_set_rows() operator to be supported (see #14285). In order to be able to reuse a compute graph, its topology (shapes, strides, parameters, etc.) has to be entirely defined by the set of input tensors (e.g. inp_embd, inp_pos, inp_attn, etc.). This PR adds logic that after constructing all input tensors, we compare them with the input tensors used in the previous graph and if they match we return early from the graph building function and reuse the previous graph. For this to work, we should no longer preemptively reset the scheduler after processing a batch so that all buffers from the previous graph remain allocated and ready for reuse, in case the new ubatch is compatible.

The other change that is needed is to introduce a way to swap the llama_memory_context of all graph inputs, so that the new call to llm_graph_result_i::set_inputs() uses the correct context from the current ubatch. To achieve this, we extend llm_graph_result_i with an update method:

llama.cpp/src/llama-graph.h

Lines 463 to 469 in f61b0f7

    
           void update(llama_memory_context_i * mctx) override { 
        
               for (auto & input : inputs) { 
        
                   input->update(mctx); 
        
               } 
        
           }

There are still some rough edges to polish, but the general idea should be visible.

Note: Technically, we don't even have to construct the input tensors explicitly as currently proposed. We just have to check that the parameters that determine their shapes (such as, n_tokens, n_outputs, n_kv, etc.) are the same as last time. But performing the check over the set of input tensors seems one step safer and less error-prone.

Enabled models

Llama
Qwen 2.5
Gemma 3

To enable other models, add the following check when building the compute graph:

llama.cpp/src/llama-model.cpp

Lines 4830 to 4837 in f61b0f7

    
           // if the graph supports reusing, we perform the check after creating all input tensors 
        
           // important: make sure that no input tenors are created after this point 
        
           if (res_prv && res->is_same(res_prv)) { 
        
               can_reuse = true; 
        
               return; 
        
           }

And implement any missing llm_graph_input_i::is_same() methods.

Tests

LLAMA_SET_ROWS=1 ./bin/llama-cli -m ../models/llama-3.2-3b-instruct/ggml-model-q8_0.gguf -p "I believe the meaning of life is" -n 32 --top-k 1 -fa

LLAMA_SET_ROWS=1 ./bin/llama-parallel -m ../models/qwen2.5-3b-coder/ggml-model-q8_0.gguf -np 8 -ns 128 -s 1 -c 4096 -fa -n 128

Benchmark on M2 Ultra:

LLAMA_SET_ROWS=1 ./scripts/compare-commits.sh gg/kv-cache-use-set-rows gg/llama-reuse-graphs -m models/qwen2.5-3b-coder/ggml-model-q8_0.gguf -m models/qwen2.5-3b-coder/ggml-model-q4_0.gguf -m models/qwen2.5-1.5b-coder/ggml-model-q4_0.gguf -m models/qwen2.5-1.5b-coder/ggml-model-q8_0.gguf -m models/gemma-3-4b/ggml-model-q4_0.gguf -m models/llama-3.2-1b-instruct/ggml-model-q8_0.gguf -fa 0,1 -t 1 -r 10

Model	FA	Test	t/s gg/kv-cache-use-set-rows	t/s gg/llama-reuse-graphs	Speedup
gemma3 4B Q4_0	No	pp512	2558.21	2562.49	1.00
gemma3 4B Q4_0	No	tg128	116.88	120.59	1.03
gemma3 4B Q4_0	Yes	pp512	2455.14	2460.37	1.00
gemma3 4B Q4_0	Yes	tg128	117.09	120.85	1.03
llama 1B Q8_0	No	pp512	7651.88	7658.07	1.00
llama 1B Q8_0	No	tg128	259.10	266.79	1.03
llama 1B Q8_0	Yes	pp512	8034.09	8057.90	1.00
llama 1B Q8_0	Yes	tg128	276.17	283.81	1.03
qwen2 1.5B Q4_0	No	pp512	5864.40	5864.61	1.00
qwen2 1.5B Q4_0	No	tg128	204.23	211.84	1.04
qwen2 1.5B Q4_0	Yes	pp512	6051.94	6070.36	1.00
qwen2 1.5B Q4_0	Yes	tg128	213.10	221.57	1.04
qwen2 1.5B Q8_0	No	pp512	5799.81	5832.32	1.01
qwen2 1.5B Q8_0	No	tg128	172.15	178.75	1.04
qwen2 1.5B Q8_0	Yes	pp512	5973.31	5984.35	1.00
qwen2 1.5B Q8_0	Yes	tg128	178.22	184.33	1.03
qwen2 3B Q4_0	No	pp512	2907.74	2913.05	1.00
qwen2 3B Q4_0	No	tg128	140.73	146.11	1.04
qwen2 3B Q4_0	Yes	pp512	2969.73	2976.17	1.00
qwen2 3B Q4_0	Yes	tg128	146.36	151.68	1.04
qwen2 3B Q8_0	No	pp512	2841.67	2855.64	1.00
qwen2 3B Q8_0	No	tg128	110.94	114.28	1.03
qwen2 3B Q8_0	Yes	pp512	2911.73	2918.46	1.00
qwen2 3B Q8_0	Yes	tg128	114.24	117.33	1.03

TODO

Clean-up and improve new interfaces and members
Avoid graph input dynamic casts in is_same methods?
Allow to reuse more models
Manual user option to force disable of graph reuse?

ggml-ci

ggerganov mentioned this pull request Jul 1, 2025

kv-cache : use ggml_set_rows #14285

Merged

5 tasks

rgerganov marked this pull request as ready for review July 2, 2025 06:04

ggerganov force-pushed the gg/kv-cache-use-set-rows branch from 2f577c5 to 30b4d4e Compare July 2, 2025 12:49

ggerganov mentioned this pull request Jul 3, 2025

llama : support Jamba hybrid Transformer-Mamba models #7531

Open

8 tasks

Base automatically changed from gg/kv-cache-use-set-rows to master July 3, 2025 07:53

llama : reuse compute graphs

d9e1781

ggml-ci

ggerganov force-pushed the gg/llama-reuse-graphs branch from f61b0f7 to d9e1781 Compare July 3, 2025 08:00

gabe-l-hart mentioned this pull request Jul 3, 2025

Granite Four #13550

Draft

3 tasks

esrakorkmz approved these changes Jul 3, 2025

View reviewed changes

ggerganov marked this pull request as draft July 4, 2025 05:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : reuse compute graphs #14482

llama : reuse compute graphs #14482

ggerganov commented Jul 1, 2025 •

edited

Loading

Uh oh!

Uh oh!


	void update(llama_memory_context_i * mctx) override {
	for (auto & input : inputs) {
	input->update(mctx);
	}
	}


	// if the graph supports reusing, we perform the check after creating all input tensors
	// important: make sure that no input tenors are created after this point
	if (res_prv && res->is_same(res_prv)) {
	can_reuse = true;
	return;
	}

llama : reuse compute graphs #14482

Are you sure you want to change the base?

llama : reuse compute graphs #14482

Conversation

ggerganov commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Enabled models

Tests

TODO

Uh oh!

Uh oh!

ggerganov commented Jul 1, 2025 •

edited

Loading