You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Changes in PR #12332 introduced repacking of Q4_K model weights for the AVX2 CPU backend. The repacking during model load is very slow for large models and it degrades token generation performance on NUMA systems. For some models such as Command A, it also degrades prompt processing speeds. While these slowdowns can be avoided by disabling repacking using the command line argument -ot .=CPU, such repacking is a problematic default.
Repacking of weights for medium or large sized models is frustratingly slow. For the Command A model in Q4_K_M repacking takes around 20 minutes, which is absurdly slow. It also seems that repacked weights are being stored in the RAM belonging to only one CPU socket, which results in poor NUMA performance, even when using the --numa distribute option.
Here are benchmark results on my dual Xeon 4216 system with Command A and Phi 4 using llama.cpp commit 3d82dbc (tag b4929) (that introduced the repacking for block interleaving) and with that commit reverted. Performance with the current commit c262bed (b5043) is near identical to b4929. For the benchmarks, I gave two prompts: "Hello" followed by "Tell me a joke".
Command:
./llama-cli -m ~/c4ai-command-a-03-2025-Q4_K_M-00001-of-00002.gguf --numa distribute --threads 64 -c 32768 --temp 0.5
Command A, with repacking/interleaving:
llama_perf_context_print: load time = 1221419.61 ms
llama_perf_context_print: prompt eval time = 5232.90 ms / 17 tokens ( 307.82 ms per token, 3.25 tokens per second)
llama_perf_context_print: eval time = 33647.99 ms / 55 runs ( 611.78 ms per token, 1.63 tokens per second)
Command A, without repacking/interleaving:
llama_perf_context_print: load time = 109457.37 ms
llama_perf_context_print: prompt eval time = 4787.59 ms / 17 tokens ( 281.62 ms per token, 3.55 tokens per second)
llama_perf_context_print: eval time = 32230.27 ms / 62 runs ( 519.84 ms per token, 1.92 tokens per second)
Command:
./llama-cli -m ~/phi-4-abliterated.Q4_K_M.gguf --numa distribute --threads 64 -c 32768 --temp 0.5
Phi 4, with repacking/interleaving:
llama_perf_context_print: load time = 76962.44 ms
llama_perf_context_print: prompt eval time = 822.04 ms / 22 tokens ( 37.37 ms per token, 26.76 tokens per second)
llama_perf_context_print: eval time = 6582.56 ms / 58 runs ( 113.49 ms per token, 8.81 tokens per second)
Phi 4, without repacking/interleaving:
llama_perf_context_print: load time = 17113.48 ms
llama_perf_context_print: prompt eval time = 862.61 ms / 22 tokens ( 39.21 ms per token, 25.50 tokens per second)
llama_perf_context_print: eval time = 6401.67 ms / 61 runs ( 104.95 ms per token, 9.53 tokens per second)
If I avoid NUMA by only using a single socket, the prompt processing and token generation performance penalty goes away, but repacking during model load is still quite slow.
Command:
./llama-cli -m ~/phi-4-abliterated.Q4_K_M.gguf --numa isolate --threads 32 -c 32768 --temp 0.5
Phi 4, with repacking/interleaving:
llama_perf_context_print: load time = 77142.05 ms
llama_perf_context_print: prompt eval time = 1171.54 ms / 22 tokens ( 53.25 ms per token, 18.78 tokens per second)
llama_perf_context_print: eval time = 7431.09 ms / 62 runs ( 119.86 ms per token, 8.34 tokens per second)
Phi 4, without repacking/interleaving:
llama_perf_context_print: load time = 19604.91 ms
llama_perf_context_print: prompt eval time = 1412.99 ms / 22 tokens ( 64.23 ms per token, 15.57 tokens per second)
llama_perf_context_print: eval time = 5974.69 ms / 46 runs ( 129.88 ms per token, 7.70 tokens per second)
Note: I erased caches by writing 3 to /proc/sys/vm/drop_caches between every benchmark run so that the NUMA options can do their job properly.
Given the low CPU usage (16% of one core) that I observed during the repacking process, my guess is that repacking isn't being parallelized and distributed across CPU sockets in accordance with the --numa distribute command line argument. My guess is that if weight repacking were appropriately parallelized and distributed across CPU sockets in accordance with the NUMA options, repacking would be fast and the inference performance regressions would be avoided.
Name and Version
Operating systems
Linux
GGML backends
CPU
Hardware
2x Intel Xeon 4216 (dual socket)
Models
Problem description & steps to reproduce
Changes in PR #12332 introduced repacking of Q4_K model weights for the AVX2 CPU backend. The repacking during model load is very slow for large models and it degrades token generation performance on NUMA systems. For some models such as Command A, it also degrades prompt processing speeds. While these slowdowns can be avoided by disabling repacking using the command line argument
-ot .=CPU
, such repacking is a problematic default.Repacking of weights for medium or large sized models is frustratingly slow. For the Command A model in Q4_K_M repacking takes around 20 minutes, which is absurdly slow. It also seems that repacked weights are being stored in the RAM belonging to only one CPU socket, which results in poor NUMA performance, even when using the
--numa distribute
option.Here are benchmark results on my dual Xeon 4216 system with Command A and Phi 4 using llama.cpp commit 3d82dbc (tag b4929) (that introduced the repacking for block interleaving) and with that commit reverted. Performance with the current commit c262bed (b5043) is near identical to b4929. For the benchmarks, I gave two prompts: "Hello" followed by "Tell me a joke".
If I avoid NUMA by only using a single socket, the prompt processing and token generation performance penalty goes away, but repacking during model load is still quite slow.
Note: I erased caches by writing 3 to
/proc/sys/vm/drop_caches
between every benchmark run so that the NUMA options can do their job properly.First Bad Commit
3d82dbc
ggml : block interleaving support for Q4_K quantization for x86 AVX2 architecture (#12332)
Relevant log output
The text was updated successfully, but these errors were encountered: