Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
target #14285
PoC for reusing computation graphs. Works with any batch sizes and is to a large extend generic.
This functionality requires the
ggml_set_rows()
operator to be supported (see #14285). In order to be able to reuse a compute graph, its topology (shapes, strides, parameters, etc.) has to be entirely defined by the set of input tensors (e.g.inp_embd
,inp_pos
,inp_attn
, etc.). This PR adds logic that after constructing all input tensors, we compare them with the input tensors used in the previous graph and if they match we return early from the graph building function and reuse the previous graph. For this to work, we should no longer preemptively reset the scheduler after processing a batch so that all buffers from the previous graph remain allocated and ready for reuse, in case the newubatch
is compatible.The other change that is needed is to introduce a way to swap the
llama_memory_context
of all graph inputs, so that the new call tollm_graph_result_i::set_inputs()
uses the correct context from the currentubatch
. To achieve this, we extendllm_graph_result_i
with an update method:llama.cpp/src/llama-graph.h
Lines 463 to 469 in f61b0f7
There are still some rough edges to polish, but the general idea should be visible.
Note: Technically, we don't even have to construct the input tensors explicitly as currently proposed. We just have to check that the parameters that determine their shapes (such as,
n_tokens
,n_outputs
,n_kv
, etc.) are the same as last time. But performing the check over the set of input tensors seems one step safer and less error-prone.Enabled models
To enable other models, add the following check when building the compute graph:
llama.cpp/src/llama-model.cpp
Lines 4830 to 4837 in f61b0f7
And implement any missing
llm_graph_input_i::is_same()
methods.Tests
LLAMA_SET_ROWS=1 ./bin/llama-cli -m ../models/llama-3.2-3b-instruct/ggml-model-q8_0.gguf -p "I believe the meaning of life is" -n 32 --top-k 1 -fa LLAMA_SET_ROWS=1 ./bin/llama-parallel -m ../models/qwen2.5-3b-coder/ggml-model-q8_0.gguf -np 8 -ns 128 -s 1 -c 4096 -fa -n 128
Benchmark on M2 Ultra:
TODO
is_same
methods?