-
Notifications
You must be signed in to change notification settings - Fork 12.9k
quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest MSE error #15550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Sharing some like-for-like test results showing that this approach produces, in the majority of the cases, better quality models compared to naive quantisation (i.e. simply running standard To reduce the duration of the tests, I have chosen two small but representative models: Llama-3.2-1B ("classic" transformer architecture) and Huihui-MoE-1.2B-A0.6B (typical Mixture or Experts). The test protocol for each is:
Llama-3.2-1B results:
Huihui-MoE-1.2B-A0.6B results:
PPL: the smaller the better; 𝜌PPL: the higher the better; KLD: the smaller the better; In bold: best quality NoteAlthough these are very encouraging results, more testing with different model architectures and sizes will be required before categorically concluding this functionality consistently yields higher quality models. Comments, feedback and, in particular, bug reports are very much welcome |
This PR introduces a new option
--target-bpw
implementing an optimised quant type selection algorithm to automatically determine per-tensor quantisation types in order to achieve a target bits-per-weight (bpw), with minimal estimated quality loss.The selection algorithm,
The
target_bpw_type()
function will look over all quantisable tensors (e.g. embedding, output, etc.) unless--output-tensor-type
,--token-embedding-type
, and/or--tensor-type
options are also used, in which case they'll take precedence.--prune-layers
can also be used in the same run, in which case thetarget_bpw_type()
will skip the pruned layers and only consider the remaining against the total bpw budget.An imatrix is required for the algorithm to work. If activations are included in the imatrix file, the error estimation will be more accurate. At the time of writing, this is only available by generating the file using #14891 with
--activation-statistics
and--output-format gguf
options.Typical usage:
llama-quantize --imatrix imatrix-with-activations.gguf --target-bpw 5.18 LLM-Model-F16.gguf BPW-Quantized-Q4_K_M.gguf q4_k_m
Special thanks to @ddh0 and @compilade for their contributions during the development of this PR.
PR created in draft until testing is completed