Skip to content

Conversation

EAddario
Copy link
Contributor

@EAddario EAddario commented Aug 24, 2025

This PR introduces a new option --target-bpw implementing an optimised quant type selection algorithm to automatically determine per-tensor quantisation types in order to achieve a target bits-per-weight (bpw), with minimal estimated quality loss.

The selection algorithm,

  • builds a candidate set of quant types (K or IQ types)
  • for each layer/tensor, it simulates quantise→dequantise per candidate type, and estimates error using a weighted MSE error function. If the imatrix includes activations, it adds a bias penalty term to better reflect forward‑pass impact, making the error estimation more accurate and thus the quant type selection
  • it filters candidates to the pareto frontier (lowest error for a given size), then starts from the smallest bpw mix increasing to larger formats, based on the best error‑reduction per added bit, until the global bpw budget is reached
  • returns a map of tensor name → ggml_type overrides, which the main quantisation pass uses. If the minimum achievable BPW already exceeds the target, it returns that minimum.

The target_bpw_type() function will look over all quantisable tensors (e.g. embedding, output, etc.) unless --output-tensor-type, --token-embedding-type, and/or --tensor-type options are also used, in which case they'll take precedence.

--prune-layers can also be used in the same run, in which case the target_bpw_type() will skip the pruned layers and only consider the remaining against the total bpw budget.

An imatrix is required for the algorithm to work. If activations are included in the imatrix file, the error estimation will be more accurate. At the time of writing, this is only available by generating the file using #14891 with --activation-statistics and --output-format gguf options.

Typical usage: llama-quantize --imatrix imatrix-with-activations.gguf --target-bpw 5.18 LLM-Model-F16.gguf BPW-Quantized-Q4_K_M.gguf q4_k_m

Special thanks to @ddh0 and @compilade for their contributions during the development of this PR.

PR created in draft until testing is completed

@netrunnereve
Copy link
Collaborator

netrunnereve commented Aug 25, 2025

This is a very interesting idea and makes me think of video compression. In video we can use a variable bitrate algorithm that allocates more bits to scenes with lots of detail and less bits for say a still image, all while targeting a preset birate.

I'm just thinking here but maybe in the future we can consider performance as well and automatically juggle error and speed with some sort of slider like what they have for video.

screenshot

@EAddario EAddario changed the title quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest MSE error possible quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest MSE error Aug 25, 2025
@EAddario
Copy link
Contributor Author

Sharing some like-for-like test results showing that this approach produces, in the majority of the cases, better quality models compared to naive quantisation (i.e. simply running standard llama-quantize with no further optimisations).

To reduce the duration of the tests, I have chosen two small but representative models: Llama-3.2-1B ("classic" transformer architecture) and Huihui-MoE-1.2B-A0.6B (typical Mixture or Experts).

The test protocol for each is:

  1. Generate Q8_0, Q6_K, Q5_K_M, Q5_K_S, Q4_K_M, Q4_K_S, Q3_K_L, Q3_K_M, Q3_K_S, IQ4_NL, IQ3_M, and IQ3_S naive quantisations (e.g. llama-quantize --imatrix imatrix-with-activations.gguf LLM-Model-F16.gguf Naive-Quantized-<TYPE>.gguf <type>)
  2. Determine each model bits per weight (bpw). This can be easily done by using python llama.cpp/gguf-py/gguf/scripts/gguf_dump.py --markdown BPW-Quantized-<TYPE>.gguf type
  3. Generate equivalent quant types by setting --target-bpw to the corresponding bpw values (e.g. llama-quantize --imatrix imatrix-with-activations.gguf --target-bpw <naive bpw> LLM-Model-F16.gguf BPW-Quantized-<TYPE>.gguf <type>)
  4. Calculate quality scores via llama-perplexity -m <Naive|BPW>-Quantized-<TYPE>.gguf -f calibration_dataset.txt --kl-divergence-base LLM-Model-F16.logits --kl-divergence

Llama-3.2-1B results:

Model Naive BPW Target BPW Naive PPL PPL Naive 𝜌PPL 𝜌PPL Naive KLD KLD
IQ3_M 4.2042 4.2058 11.21441 13.08066 97.46% 94.62% 0.14661 0.29047
IQ3_S 4.1177 4.1191 11.41846 14.10772 97.08% 93.22% 0.16744 0.36883
IQ4_NL 4.9535 4.9542 10.10609 9.98096 99.19% 99.41% 0.04641 0.03356
Q3_K_L 4.6913 4.6894 10.74840 10.30599 98.10% 98.83% 0.10510 0.06738
Q3_K_M 4.4215 4.4184 10.97909 10.42277 97.71% 98.65% 0.12602 0.07920
Q3_K_S 4.1033 4.1037 14.19578 12.11165 92.80% 95.92% 0.37986 0.22400
Q4_K_M 5.1779 5.1792 10.01618 9.88781 99.34% 99.54% 0.03732 0.02654
Q4_K_S 4.9704 4.9762 10.06778 9.97105 99.27% 99.42% 0.04243 0.03350
Q5_K_M 5.8499 5.8521 9.75894 9.79049 99.80% 99.73% 0.01128 0.01620
Q5_K_S 5.7273 5.7291 9.76039 9.79663 99.80% 99.70% 0.01135 0.01757
Q6_K 6.5639 6.5646 9.68812 9.68277 99.91% 99.94% 0.00495 0.00354
Q8_0 8.5013 8.486 9.65172 9.64781 99.99% 99.99% 0.00050 0.00048

Huihui-MoE-1.2B-A0.6B results:

Model Naive BPW Target BPW Naive PPL PPL Naive 𝜌PPL 𝜌PPL Naive KLD KLD
IQ3_M 3.9173 3.9204 27.44670 30.17950 92.93% 91.42% 0.53704 0.58776
IQ3_S 3.8207 3.8239 29.30734 32.94412 92.20% 90.00% 0.52148 0.70061
IQ4_NL 4.5043 4.5092 19.55229 19.71237 98.62% 98.09% 0.08709 0.13948
Q3_K_L 4.3883 4.3923 21.48216 20.62434 96.80% 98.20% 0.20565 0.12301
Q3_K_M 4.1221 5.0412 21.94908 18.87276 96.43% 99.20% 0.23232 0.04863
Q3_K_S 3.8207 3.8519 26.05622 23.81005 93.87% 95.60% 0.41128 0.30752
Q4_K_M 4.9904 5.0412 18.91957 18.87276 99.02% 99.20% 0.05888 0.04863
Q4_K_S 4.7793 4.7826 19.12118 19.25212 98.89% 99.02% 0.06898 0.06238
Q5_K_M 5.7541 5.7950 18.28129 18.31989 99.66% 99.70% 0.01778 0.01531
Q5_K_S 5.6323 5.6342 18.38359 18.37216 99.63% 99.65% 0.02013 0.01884
Q6_K 6.5655 6.5693 18.20380 18.19202 99.80% 99.81% 0.00776 0.00725
Q8_0 8.5028 8.5071 18.09292 18.08959 99.90% 99.90% 0.00094 0.00090

PPL: the smaller the better; 𝜌PPL: the higher the better; KLD: the smaller the better; In bold: best quality

Note

Although these are very encouraging results, more testing with different model architectures and sizes will be required before categorically concluding this functionality consistently yields higher quality models.

Comments, feedback and, in particular, bug reports are very much welcome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants