quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest MSE error #15550

EAddario · 2025-08-24T21:44:58Z

This PR introduces a new option --target-bpw implementing an optimised quant type selection algorithm to automatically determine per-tensor quantisation types in order to achieve a target bits-per-weight (bpw), with minimal estimated quality loss.

The selection algorithm,

builds a candidate set of quant types (K or IQ types)
for each layer/tensor, it simulates quantise→dequantise per candidate type, and estimates error using a weighted MSE error function. If the imatrix includes activations, it adds a bias penalty term to better reflect forward‑pass impact, making the error estimation more accurate and thus the quant type selection
it filters candidates to the pareto frontier (lowest error for a given size), then starts from the smallest bpw mix increasing to larger formats, based on the best error‑reduction per added bit, until the global bpw budget is reached
returns a map of tensor name → ggml_type overrides, which the main quantisation pass uses. If the minimum achievable BPW already exceeds the target, it returns that minimum.

The target_bpw_type() function will look over all quantisable tensors (e.g. embedding, output, etc.) unless --output-tensor-type, --token-embedding-type, and/or --tensor-type options are also used, in which case they'll take precedence.

--prune-layers can also be used in the same run, in which case the target_bpw_type() will skip the pruned layers and only consider the remaining against the total bpw budget.

An imatrix is required for the algorithm to work. If activations are included in the imatrix file, the error estimation will be more accurate. At the time of writing, this is only available by generating the file using #14891 with --activation-statistics and --output-format gguf options.

Typical usage: llama-quantize --imatrix imatrix-with-activations.gguf --target-bpw 5.18 LLM-Model-F16.gguf BPW-Quantized-Q4_K_M.gguf q4_k_m

Special thanks to @ddh0 and @compilade for their contributions during the development of this PR.

PR created in draft until testing is completed

netrunnereve · 2025-08-25T01:23:18Z

This is a very interesting idea and makes me think of video compression. In video we can use a variable bitrate algorithm that allocates more bits to scenes with lots of detail and less bits for say a still image, all while targeting a preset birate.

I'm just thinking here but maybe in the future we can consider performance as well and automatically juggle error and speed with some sort of slider like what they have for video.

EAddario · 2025-08-25T10:52:43Z

Sharing some like-for-like test results showing that this approach produces, in the majority of the cases, better quality models compared to naive quantisation (i.e. simply running standard llama-quantize with no further optimisations).

To reduce the duration of the tests, I have chosen two small but representative models: Llama-3.2-1B ("classic" transformer architecture) and Huihui-MoE-1.2B-A0.6B (typical Mixture or Experts).

The test protocol for each is:

Generate Q8_0, Q6_K, Q5_K_M, Q5_K_S, Q4_K_M, Q4_K_S, Q3_K_L, Q3_K_M, Q3_K_S, IQ4_NL, IQ3_M, and IQ3_S naive quantisations (e.g. llama-quantize --imatrix imatrix-with-activations.gguf LLM-Model-F16.gguf Naive-Quantized-<TYPE>.gguf <type>)
Determine each model bits per weight (bpw). This can be easily done by using python llama.cpp/gguf-py/gguf/scripts/gguf_dump.py --markdown BPW-Quantized-<TYPE>.gguf type
Generate equivalent quant types by setting --target-bpw to the corresponding bpw values (e.g. llama-quantize --imatrix imatrix-with-activations.gguf --target-bpw <naive bpw> LLM-Model-F16.gguf BPW-Quantized-<TYPE>.gguf <type>)
Calculate quality scores via llama-perplexity -m <Naive|BPW>-Quantized-<TYPE>.gguf -f calibration_dataset.txt --kl-divergence-base LLM-Model-F16.logits --kl-divergence

Llama-3.2-1B results:

Model	Naive BPW	Target BPW	Naive PPL	PPL	Naive 𝜌PPL	𝜌PPL	Naive KLD	KLD
IQ3_M	4.2042	4.2058	11.21441	13.08066	97.46%	94.62%	0.14661	0.29047
IQ3_S	4.1177	4.1191	11.41846	14.10772	97.08%	93.22%	0.16744	0.36883
IQ4_NL	4.9535	4.9542	10.10609	9.98096	99.19%	99.41%	0.04641	0.03356
Q3_K_L	4.6913	4.6894	10.74840	10.30599	98.10%	98.83%	0.10510	0.06738
Q3_K_M	4.4215	4.4184	10.97909	10.42277	97.71%	98.65%	0.12602	0.07920
Q3_K_S	4.1033	4.1037	14.19578	12.11165	92.80%	95.92%	0.37986	0.22400
Q4_K_M	5.1779	5.1792	10.01618	9.88781	99.34%	99.54%	0.03732	0.02654
Q4_K_S	4.9704	4.9762	10.06778	9.97105	99.27%	99.42%	0.04243	0.03350
Q5_K_M	5.8499	5.8521	9.75894	9.79049	99.80%	99.73%	0.01128	0.01620
Q5_K_S	5.7273	5.7291	9.76039	9.79663	99.80%	99.70%	0.01135	0.01757
Q6_K	6.5639	6.5646	9.68812	9.68277	99.91%	99.94%	0.00495	0.00354
Q8_0	8.5013	8.486	9.65172	9.64781	99.99%	99.99%	0.00050	0.00048

Huihui-MoE-1.2B-A0.6B results:

Model	Naive BPW	Target BPW	Naive PPL	PPL	Naive 𝜌PPL	𝜌PPL	Naive KLD	KLD
IQ3_M	3.9173	3.9204	27.44670	30.17950	92.93%	91.42%	0.53704	0.58776
IQ3_S	3.8207	3.8239	29.30734	32.94412	92.20%	90.00%	0.52148	0.70061
IQ4_NL	4.5043	4.5092	19.55229	19.71237	98.62%	98.09%	0.08709	0.13948
Q3_K_L	4.3883	4.3923	21.48216	20.62434	96.80%	98.20%	0.20565	0.12301
Q3_K_M	4.1221	5.0412	21.94908	18.87276	96.43%	99.20%	0.23232	0.04863
Q3_K_S	3.8207	3.8519	26.05622	23.81005	93.87%	95.60%	0.41128	0.30752
Q4_K_M	4.9904	5.0412	18.91957	18.87276	99.02%	99.20%	0.05888	0.04863
Q4_K_S	4.7793	4.7826	19.12118	19.25212	98.89%	99.02%	0.06898	0.06238
Q5_K_M	5.7541	5.7950	18.28129	18.31989	99.66%	99.70%	0.01778	0.01531
Q5_K_S	5.6323	5.6342	18.38359	18.37216	99.63%	99.65%	0.02013	0.01884
Q6_K	6.5655	6.5693	18.20380	18.19202	99.80%	99.81%	0.00776	0.00725
Q8_0	8.5028	8.5071	18.09292	18.08959	99.90%	99.90%	0.00094	0.00090

PPL: the smaller the better; 𝜌PPL: the higher the better; KLD: the smaller the better; In bold: best quality

Note

Although these are very encouraging results, more testing with different model architectures and sizes will be required before categorically concluding this functionality consistently yields higher quality models.

Comments, feedback and, in particular, bug reports are very much welcome

EAddario added 30 commits August 19, 2025 09:54

Refactor variable name

ba7335e

Add target_bpw parameter

4d94911

Update usage

cfec404

Add parse_target_bpw()

5e85fb3

Load activations

e6d55dc

Populate activations_data with imatrix activations if present

77b818c

Process activations

0edbf0c

Process target_bpw parameter

e877474

Populate params

1b3d5b5

Refactor variable and add target_bpw

a22a9de

Add fallback_type enum

c96b8ee

Add is_iq()

9adae08

Validate if imatrix contains activations

017945a

Add target_bpw_type() logic

92f49ab

Implement bpw_overrides call

1187f6a

Refactor variable names

5aceb9e

Update comments

ee05d6b

Avoid division by zero if truncation occurs

f22b309

Increase precision for error calculation

936294f

Merge branch 'master' into quantize

b33abae

Add F16/BF16 type

5cd69a6

Add F16/BF16 type

69586e2

Do not mix K and IQ quants

29b2dc3

Add better fallbacks for IQ mixes

43caadf

Skip if output.weight or type is COPY

52da4a4

Fix bias lambda bug

3f0118d

Optimise tensor sampling

b0b33b7

Improve error estimation using weighted MSE

35ad0fc

Exclude embeddings and output tensor

5ef493e

Change error estimate to use normalised weighted MSE

95b2ab2

EAddario added 24 commits August 21, 2025 12:47

Parallelise candidate evaluation

e01dad8

Dequantise sampled rows only

887490c

Precompute error denominator in estimate_erro()

9e11f82

General code refactor

5b6f1e9

Merge branch 'master' into quantize

e6eefa6

Include embeddings and output tensors

ec0afbe

Fix byte count for 3d or higher tensors

35c1504

Update comments

bb0d912

Parameterise type

2f13fee

Reduce sampling window to speedup process

47cdbe2

Improve pareto efficient candidate selection

01c927f

Show skipped IQ tensors

897decb

Improve dequantized_buffer fill

f05c848

Refactor and combine lambdas

fea99d0

Log if override is from tensor-type or from bpw-target

6d17889

Explicitly adding <atomic> include

9a4b115

Fix typo

f75265f

Refactor estimate_error()

73124a9

Improve list of candidate types

68ae5e6

Adjust bias_lambda

decafae

Restrict quant types per family

3856d60

Execute bpw_overrides() only if an imatrix file is provided

61c0e01

Improve logging and some minor code refactoring

d4ac210

Merge branch 'master' into quantize

ccaab24

github-actions bot added the examples label Aug 24, 2025

EAddario mentioned this pull request Aug 24, 2025

imatrix: calculate activation-based statistics for new format (GGUF) imatrices #14891

Open

EAddario changed the title ~~quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest MSE error possible~~ quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest MSE error Aug 25, 2025

Minor comment update

4286690

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest MSE error #15550

quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest MSE error #15550

EAddario commented Aug 24, 2025 •

edited

Loading

Uh oh!

netrunnereve commented Aug 25, 2025 •

edited

Loading

Uh oh!

EAddario commented Aug 25, 2025

Uh oh!

Uh oh!

quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest MSE error #15550

Are you sure you want to change the base?

quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest MSE error #15550

Conversation

EAddario commented Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netrunnereve commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EAddario commented Aug 25, 2025

The test protocol for each is:

Llama-3.2-1B results:

Huihui-MoE-1.2B-A0.6B results:

Note

Uh oh!

Uh oh!

EAddario commented Aug 24, 2025 •

edited

Loading

netrunnereve commented Aug 25, 2025 •

edited

Loading