Eval bug: Weight repacking for AVX2 block interleaving is very slow and NUMA unfriendly #12759

sultanqasim · 2025-04-04T21:05:45Z

Name and Version

$ ./llama-cli --version
version: 5043 (c262bedd)
built with cc (GCC) 13.3.1 20240611 (Red Hat 13.3.1-2) for x86_64-redhat-linux

Operating systems

Linux

GGML backends

CPU

Hardware

2x Intel Xeon 4216 (dual socket)

Models

Problem description & steps to reproduce

Changes in PR #12332 introduced repacking of Q4_K model weights for the AVX2 CPU backend. The repacking during model load is very slow for large models and it degrades token generation performance on NUMA systems. For some models such as Command A, it also degrades prompt processing speeds. While these slowdowns can be avoided by disabling repacking using the command line argument -ot .=CPU, such repacking is a problematic default.

Repacking of weights for medium or large sized models is frustratingly slow. For the Command A model in Q4_K_M repacking takes around 20 minutes, which is absurdly slow. It also seems that repacked weights are being stored in the RAM belonging to only one CPU socket, which results in poor NUMA performance, even when using the --numa distribute option.

Here are benchmark results on my dual Xeon 4216 system with Command A and Phi 4 using llama.cpp commit 3d82dbc (tag b4929) (that introduced the repacking for block interleaving) and with that commit reverted. Performance with the current commit c262bed (b5043) is near identical to b4929. For the benchmarks, I gave two prompts: "Hello" followed by "Tell me a joke".

Command:
./llama-cli -m ~/c4ai-command-a-03-2025-Q4_K_M-00001-of-00002.gguf --numa distribute --threads 64 -c 32768 --temp 0.5

Command A, with repacking/interleaving:
llama_perf_context_print:        load time = 1221419.61 ms
llama_perf_context_print: prompt eval time =    5232.90 ms /    17 tokens (  307.82 ms per token,     3.25 tokens per second)
llama_perf_context_print:        eval time =   33647.99 ms /    55 runs   (  611.78 ms per token,     1.63 tokens per second)

Command A, without repacking/interleaving:
llama_perf_context_print:        load time =  109457.37 ms
llama_perf_context_print: prompt eval time =    4787.59 ms /    17 tokens (  281.62 ms per token,     3.55 tokens per second)
llama_perf_context_print:        eval time =   32230.27 ms /    62 runs   (  519.84 ms per token,     1.92 tokens per second)

Command:
./llama-cli -m ~/phi-4-abliterated.Q4_K_M.gguf --numa distribute --threads 64 -c 32768 --temp 0.5

Phi 4, with repacking/interleaving:
llama_perf_context_print:        load time =   76962.44 ms
llama_perf_context_print: prompt eval time =     822.04 ms /    22 tokens (   37.37 ms per token,    26.76 tokens per second)
llama_perf_context_print:        eval time =    6582.56 ms /    58 runs   (  113.49 ms per token,     8.81 tokens per second)

Phi 4, without repacking/interleaving:
llama_perf_context_print:        load time =   17113.48 ms
llama_perf_context_print: prompt eval time =     862.61 ms /    22 tokens (   39.21 ms per token,    25.50 tokens per second)
llama_perf_context_print:        eval time =    6401.67 ms /    61 runs   (  104.95 ms per token,     9.53 tokens per second)

If I avoid NUMA by only using a single socket, the prompt processing and token generation performance penalty goes away, but repacking during model load is still quite slow.

Command:
./llama-cli -m ~/phi-4-abliterated.Q4_K_M.gguf --numa isolate --threads 32 -c 32768 --temp 0.5

Phi 4, with repacking/interleaving:
llama_perf_context_print:        load time =   77142.05 ms
llama_perf_context_print: prompt eval time =    1171.54 ms /    22 tokens (   53.25 ms per token,    18.78 tokens per second)
llama_perf_context_print:        eval time =    7431.09 ms /    62 runs   (  119.86 ms per token,     8.34 tokens per second)

Phi 4, without repacking/interleaving:
llama_perf_context_print:        load time =   19604.91 ms
llama_perf_context_print: prompt eval time =    1412.99 ms /    22 tokens (   64.23 ms per token,    15.57 tokens per second)
llama_perf_context_print:        eval time =    5974.69 ms /    46 runs   (  129.88 ms per token,     7.70 tokens per second)

Note: I erased caches by writing 3 to /proc/sys/vm/drop_caches between every benchmark run so that the NUMA options can do their job properly.

First Bad Commit

3d82dbc

ggml : block interleaving support for Q4_K quantization for x86 AVX2 architecture (#12332)

Relevant log output

[sultan@wailer bin]$ echo 3 | sudo tee /proc/sys/vm/drop_caches
3
[sultan@wailer bin]$ ./llama-cli --version
version: 5043 (c262bedd)
built with cc (GCC) 13.3.1 20240611 (Red Hat 13.3.1-2) for x86_64-redhat-linux
[sultan@wailer bin]$ ./llama-cli -m ~/phi-4-abliterated.Q4_K_M.gguf --numa isolate --threads 32 -c 32768 --temp 0.5
build: 5043 (c262bedd) with cc (GCC) 13.3.1 20240611 (Red Hat 13.3.1-2) for x86_64-redhat-linux
main: llama backend init
/proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 36 key-value pairs and 243 tensors from /home/sultan/phi-4-abliterated.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Phi 4 Abliterated
llama_model_loader: - kv   3:                         general.size_label str              = 15B
llama_model_loader: - kv   4:                            general.license str              = mit
llama_model_loader: - kv   5:                       general.license.link str              = https://huggingface.co/huihui-ai/phi-...
llama_model_loader: - kv   6:                   general.base_model.count u32              = 1
llama_model_loader: - kv   7:                  general.base_model.0.name str              = Phi 4
llama_model_loader: - kv   8:               general.base_model.0.version str              = 4
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Microsoft
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/microsoft/phi-4
llama_model_loader: - kv  11:                               general.tags arr[str,9]       = ["phi", "nlp", "math", "code", "chat"...
llama_model_loader: - kv  12:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  13:                        phi3.context_length u32              = 16384
llama_model_loader: - kv  14:  phi3.rope.scaling.original_context_length u32              = 16384
llama_model_loader: - kv  15:                      phi3.embedding_length u32              = 5120
llama_model_loader: - kv  16:                   phi3.feed_forward_length u32              = 17920
llama_model_loader: - kv  17:                           phi3.block_count u32              = 40
llama_model_loader: - kv  18:                  phi3.attention.head_count u32              = 40
llama_model_loader: - kv  19:               phi3.attention.head_count_kv u32              = 10
llama_model_loader: - kv  20:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  21:                  phi3.rope.dimension_count u32              = 128
llama_model_loader: - kv  22:                        phi3.rope.freq_base f32              = 250000.000000
llama_model_loader: - kv  23:              phi3.attention.sliding_window u32              = 0
llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = dbrx
llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 100257
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 100265
llama_model_loader: - kv  31:            tokenizer.ggml.unknown_token_id u32              = 5809
llama_model_loader: - kv  32:            tokenizer.ggml.padding_token_id u32              = 100351
llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {% for message in messages %}{% if (m...
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - kv  35:                          general.file_type u32              = 15
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_K:  101 tensors
llama_model_loader: - type q5_K:   40 tensors
llama_model_loader: - type q6_K:   21 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 8.43 GiB (4.94 BPW) 
load: special tokens cache size = 96
load: token to piece cache size = 0.6151 MB
print_info: arch             = phi3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 16384
print_info: n_embd           = 5120
print_info: n_layer          = 40
print_info: n_head           = 40
print_info: n_head_kv        = 10
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1280
print_info: n_embd_v_gqa     = 1280
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 17920
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 250000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 16384
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 14B
print_info: model params     = 14.66 B
print_info: general.name     = Phi 4 Abliterated
print_info: vocab type       = BPE
print_info: n_vocab          = 100352
print_info: n_merges         = 100000
print_info: BOS token        = 100257 '<|endoftext|>'
print_info: EOS token        = 100265 '<|im_end|>'
print_info: EOT token        = 100257 '<|endoftext|>'
print_info: UNK token        = 5809 'ï¿½'
print_info: PAD token        = 100351 '<|dummy_87|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 100258 '<|fim_prefix|>'
print_info: FIM SUF token    = 100260 '<|fim_suffix|>'
print_info: FIM MID token    = 100259 '<|fim_middle|>'
print_info: EOG token        = 100257 '<|endoftext|>'
print_info: EOG token        = 100265 '<|im_end|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_Mapped model buffer size =  8531.89 MiB
load_tensors:  CPU_AARCH64 model buffer size =  5484.38 MiB
....................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 32768
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 250000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_pre_seq (32768) > n_ctx_train (16384) -- possible training context overflow
llama_context:        CPU  output buffer size =     0.38 MiB
init: kv_size = 32768, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1
init:        CPU KV buffer size =  6400.00 MiB
llama_context: KV self size  = 6400.00 MiB, K (f16): 3200.00 MiB, V (f16): 3200.00 MiB
llama_context:        CPU compute buffer size =  2669.01 MiB
llama_context: graph nodes  = 1686
llama_context: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 32768
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 32
main: model was trained on only 16384 context tokens (32768 specified)
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
<|im_start|>system<|im_sep|>You are a helpful assistant<|im_end|><|im_start|>user<|im_sep|>Hello<|im_end|><|im_start|>assistant<|im_sep|>Hi there<|im_end|><|im_start|>user<|im_sep|>How are you?<|im_end|><|im_start|>assistant<|im_sep|>

system_info: n_threads = 32 (n_threads_batch = 32) / 64 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: interactive mode on.
sampler seed: 4093984923
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 32768
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.500
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 32768, n_batch = 2048, n_predict = -1, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT


> Hello
Hello! How can I assist you today?

> Tell me a joke
Sure! Here's a classic one for you:

Why don't scientists trust atoms?

Because they make up everything! 😄

I hope that brought a smile to your face! If you'd like to hear another joke, just let me know.

> 
llama_perf_sampler_print:    sampling time =       7.77 ms /    63 runs   (    0.12 ms per token,  8109.15 tokens per second)
llama_perf_context_print:        load time =   77165.47 ms
llama_perf_context_print: prompt eval time =    1180.28 ms /    22 tokens (   53.65 ms per token,    18.64 tokens per second)
llama_perf_context_print:        eval time =    7133.47 ms /    59 runs   (  120.91 ms per token,     8.27 tokens per second)
llama_perf_context_print:       total time =   29961.81 ms /    81 tokens
Interrupted by user

The text was updated successfully, but these errors were encountered:

sultanqasim · 2025-04-04T21:10:34Z

Given the low CPU usage (16% of one core) that I observed during the repacking process, my guess is that repacking isn't being parallelized and distributed across CPU sockets in accordance with the --numa distribute command line argument. My guess is that if weight repacking were appropriately parallelized and distributed across CPU sockets in accordance with the NUMA options, repacking would be fast and the inference performance regressions would be avoided.

sultanqasim added the bug-unconfirmed label Apr 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: Weight repacking for AVX2 block interleaving is very slow and NUMA unfriendly #12759

Eval bug: Weight repacking for AVX2 block interleaving is very slow and NUMA unfriendly #12759

sultanqasim commented Apr 4, 2025

sultanqasim commented Apr 4, 2025

Eval bug: Weight repacking for AVX2 block interleaving is very slow and NUMA unfriendly #12759

Eval bug: Weight repacking for AVX2 block interleaving is very slow and NUMA unfriendly #12759

Comments

sultanqasim commented Apr 4, 2025

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

sultanqasim commented Apr 4, 2025