imatrix: calculate activation-based statistics for new format (GGUF) imatrices #14891

EAddario · 2025-07-26T16:47:29Z

Following up from #9400 and #12718, I've started tinkering with activation-based statistics, in addition to what's currently available via --show-statistics.

At the moment, I'm exploring three options going from from easy to implement and OK approximation, to some assembly required but fairly accurate:

L2 norm of activation difference: where larger values would suggest the tensor has significantly transformed the input with respect to the previous layer.
KL Divergence reduction using a pre-computed logit file: using a similar approach as described by nostalgebraist in logit lens, and based on a pre-computed logit file (e.g. from a previous llama-perplexity --save-all-logits run)
Given that llama-imatrix already generates the actual logits to compute PPL, use Thông T. Nguyễn's logit prism approach to calculate the exact contribution of each layer to the final logit scores

Sharing with the readers, and in particular @compilade and @jukofyork, in case anyone's willing to double check assumptions and/or suggest alternative approaches I haven't considered.

tools/imatrix/imatrix.cpp

jukofyork · 2025-07-29T11:15:01Z

L2 norm of activation difference: where larger values would suggest the tensor has significantly transformed the input with respect to the previous layer.

If we had access to some numerical linear algebra routines then it would likely be possible to get much more interesting stats from this.

If you think about it:

The L2 norm of the activation difference is just measuring the Euclidean distance of the tip of the input vector vs the tip of the output vector.
The mean of these norms probably isn't that interesting (but could be used to test if a quant is systematically biasing or scaling the activations).
The variance of these norms is likely much more interesting and tells you about the "richness" of the transformation (indirectly - see below).

If instead of using the L2 norms of the differences, we construct the cross-covariance matrix of the paired samples, and then take the SVD of this:

The "richness" of the transformation (measured indirectly above) is actually to do with the distribution of the singular values, eg: there are many sets of activation differences with the same L2-norm, but those with a flat(er) distribution of singular values (vs a couple of large singular values) are likely to be much more important and interesting.
If you convert the SVD into a polar decomposition, then the scaling and rotational components will likely lead to other interesting insights, eg:

I suspect that the scaling part of the transformation is quite well handled by the current scaler quants, but the rotational component is likely not.

IIRC, some of the 1-2bit quants use vector quantization, and if so; these will likely handle the rotational components better and/or show quite different properties.

I'm on my phone ATM so can't easily link them, but there have been several papers showing:

Outlier activations in LLMs matter much more than simple rate–distortion theory would suggest/measure. This is likely related to the "flatness" of the singular values, where only rarely do some singular vector directions give a high dot-product with an input activation, but when they do; they add a significant/important contribution to the output.
LLMs are much more rotational than people first realised, eg: there was [IIRC] a Microsoft paper where they constrained everything to be on the surface of a unit ball, and there are several PEFT methods that purely alter the rotational directions via orthogonal transformations.

jukofyork · 2025-07-29T11:28:59Z

If it's any use, then there is code here to analyse the symmetrised cross-covaraince matrix I used for the control vectors:

https://github.com/jukofyork/control-vectors/blob/main/direction_analyzer.py

The symmetrised version deliberately gets rid of the rotational components as there can't be made use of if we are just looking for a single direction... You can actually do the same on the anti-symmetrised version (to look at the rotational components only), but Eigen-decompostion is less useful for this as it will return all complex vectors (hence why SVD makes more sense).

I should also add that from my experiments using SVD on the tensors (ie: ignoring the activations!) of LLMs, it often appears that the early/final tensors (which actually appear to be very important and are bumped in bits in the quant routines here!), actually tend to have a less flat distribution of singular values themselves! So when you ignore the distribution of input activations - they generally appear to be doing something inherently "lower dimensional" than the middle tensors!? It would be interesting to investigate this whilst also looking at the activations...

EAddario · 2025-07-29T23:23:19Z

I'd be lying if I were to claim I understand everything in there 🥴, but I think I got the gist.

Implementing the l2 norm seems straightforward without having to introduce additional 3rd party dependencies, but completely agree that a "light" BLAS lib will be a godsend.

For now, I'll focus on l2 norm, but will add activation variance as well (good shout!)

For a later version, I'd like to try the logit prism approach but that's for another day.

Thanks for the steer @jukofyork! more weekend reading 😁

jukofyork · 2025-07-30T14:26:00Z

I'd be lying if I were to claim I understand everything in there 🥴, but I think I got the gist.

Implementing the l2 norm seems straightforward without having to introduce additional 3rd party dependencies, but completely agree that a "light" BLAS lib will be a godsend.

For now, I'll focus on l2 norm, but will add activation variance as well (good shout!)

For a later version, I'd like to try the logit prism approach but that's for another day.

Thanks for the steer @jukofyork! more weekend reading 😁

If you want to learn more about Linear Algebra then Gilbert Strang's video lectures are amazing:

https://www.youtube.com/playlist?list=PLE7DDD91010BC51F8

(IIRC, the first lecture only is bad resolution, so don't be put off by that!)

or if you like books:

https://www.amazon.co.uk/Practical-Linear-Algebra-Textbooks-Mathematics/dp/0367507846

(or one of the earlier editions of this same book)

gives a really solid foundation in terms of 2D and 3D.

The biggest problem breaking into it is for some reason American Universities decided to make it much more abstract and proof-based than it needs to be (probably to weed out potential math-majors!).

If you look at some much older pre-1980s books, or books not aimed at Westerners, then it's surprising how approachable it is:

https://mirtitles.org/?s=linear+algebra

jukofyork · 2025-07-30T14:36:14Z

but completely agree that a "light" BLAS lib will be a godsend.

I have tried to bring this up before:

#8831 (reply in thread)
#8831 (comment)

I think it would be fairly straightforward to port the non-complex routines and then open up all that GSL has to offer:

https://www.gnu.org/software/gsl/doc/html/linalg.html

instead of trying to rewrite numerical routines that have had 1000's and 1000's of thought and testing put into them! :)

compilade

My comments starting with (style) are (potentially subjective) formatting and/or code layout comments, while the (few) other ones are about correctness.

tools/imatrix/imatrix.cpp

CISC · 2025-08-16T17:22:24Z

Including the activations during file generation will double the size of the imatrix so I have added a new flag --activation-statistics to make it optional.

Doubling the size of the imatrix isn't really that concerning, I would rather have the activation data stored by default. @compilade what's your viewpoint?

EAddario · 2025-08-16T20:51:08Z

The imatrix for some of the recent models, specially with multi-modal support, can be quite big. The Kimi-K2-Instruct for example is 1.5 GB in legacy format, and GLM-4.5 is almost 700 MB

Happy to reverse the change (one less option to worry about 😁), but thought of being mindful of the user's disk space.

Co-authored-by: compilade <[email protected]>

…imatrix

jukofyork · 2025-08-23T11:48:27Z

@EAddario @compilade here is what I failed to post to discord and pastebin:

Idea

If you apply a single permutation of the model (hidden) dimension consistently to every parameter that reads from or writes to that dimension, you get a functionally identical Transformer/LLM (up to floating‑point noise). The permutation must be applied everywhere the d_model axis appears (embeddings, norms, attention and MLP projections, residual outputs, final LM head).

P does not need to satisfy P = P^T. It only needs to be a permutation matrix, which implies P^T = P^{-1}. P = P^T holds only for involutive permutations (made of 1- and 2-cycles) and is not required.

How to transform the weights

Let P ∈ R^{d×d} be a permutation matrix and use the same P across the whole model. Using row‑vector notation (token state x ∈ 1×d):

General rule

Any linear map W: hidden → S (e.g., q/k/v, up/gate in MLP):
W' = P^T W; bias stays the same.
Any linear map W: S → hidden (e.g., o_proj, down in MLP, residual projections):
W' = W P; bias b' = b P.
Any linear map W: hidden → hidden:
W' = P^T W P; bias b' = b P.
LayerNorm/RMSNorm parameters on the hidden axis:
γ' = P γ, β' = P β.
Embeddings that produce hidden states (token, positional):
E' = E P.
Final LM head:
- If tied to the input embedding, just tie to E' and you're done (because P is orthogonal: P^T = P^{-1}).
- If untied and logits = x W_out, then W_out' = P^T W_out.

Why this works (sketch)

Inductively assume the hidden state becomes x' = x P. LayerNorm is permutation‑equivariant if you also permute γ, β, so LN(x') = LN(x) P. Then:

Attention: Q' = LN(x) P · (P^T W_Q) = LN(x) W_Q = Q (same for K, V), so attention weights and outputs are unchanged; only the post‑projection to hidden becomes H' = H P, preserving the residual x' + H' = (x + H) P.
MLP: preactivations are unchanged (W_1' = P^T W_1), and the post‑projection gives the same hidden update multiplied by P (W_2' = W_2 P), preserving the residual.

RoPE and other position encodings

For models using RoPE (Rotary Positional Embedding), this doesn't matter because RoPE is applied after the Q and K projections. Since the projections themselves are unchanged in the transformed model (as shown above), RoPE operates on the same Q and K values and produces identical results.

Caveats and notes

You must apply the same permutation P everywhere the model dimension appears, including all biases on the hidden axis and all normalization parameters.
This works with both LayerNorm and RMSNorm for permutations. More general basis changes (arbitrary orthogonal or invertible matrices) do not preserve the form of normalization because the learned scale is diagonal; permutations are the safe symmetry.
If you use tied embeddings, the orthogonality of P (true for any permutation) ensures logits are unchanged when you set E' = E P and tie to E'.
Training randomness (dropout) or quantization/per‑channel calibration may make bit‑for‑bit equality hard; functionally it's the same.
There are other independent symmetries too (e.g., permuting attention heads; permuting and rescaling neurons in the FFN), but those are separate from this global d_model permutation.

Properties of P

Permutation matrix: entries in {0,1}, exactly one 1 per row/column; P^T = P^{-1}.
Not necessary: P = P^T. That holds only for involutions (P^2 = I), which is a special case and not required.

Again sorry for the LLM-written output - I'm on my phone and on holiday so can't easily write all that out!

For our specific case the order within each block (and then the order within each sub-block for the K-quants) won't matter, but there will still be an astronomically high number of block permutations and it will need some greedy algorithm to work...

I'm still unsure on what the best metric would be to to optimise too...

jukofyork · 2025-08-23T11:58:12Z

I also forgot to ask Claude to add the bit about the intermediate dimension of the MLP blocks allowing for a separate permutation matrix for each.

Edit: Actually he decided to add this himself lol:

There are other independent symmetries too (e.g., permuting attention heads; permuting and rescaling neurons in the FFN), but those are separate from this global d_model permutation.

I didn't think of the attention heads, but if they are smaller than the block size (which most are), then these can also be permuted.

EAddario · 2025-08-24T10:15:38Z

@CISC / @compilade, polite nudge to check if OK to merge?

compilade · 2025-08-24T14:18:50Z

tools/imatrix/imatrix.cpp

    const std::string in_sum2_suffix{ ".in_sum2" };
    const std::string counts_suffix{ ".counts" };

    // Could re-use m_stats instead, but this allows
    // checking for completeness of *each* loaded imatrix file
    // and also makes it easier to re-use a similar implementation in quantize.cpp
    // Using an ordered map to get a deterministic iteration order.
-    std::map<std::string, std::pair<struct ggml_tensor *, struct ggml_tensor *>> sums_counts_for;
+    std::map<std::string, std::tuple<struct ggml_tensor *, struct ggml_tensor *, struct ggml_tensor *>> sums_counts_for;


Instead of a tuple, a small struct with struct ggml_tensor * fields might be more convenient.

I think keeping it as a tuple simplifies the code and aids maintainability, but if this approach would be a blocker for merging, happy to change.

compilade · 2025-08-24T14:50:46Z

tools/imatrix/imatrix.cpp

 };

 class IMatrixCollector {
 public:
    IMatrixCollector() = default;
    void set_params(common_params params) { m_params = std::move(params); }
+    bool activation_statistics() const { return m_params.activation_statistics; }


Regarding the optionality of the sums of activations, I'm not yet sure how they could be used in the quantization algorithms.

I'm not against doubling the file size if it avoids having to recalculate the imatrix because the data ended up being necessary.

The importance in the quantization algorithms cannot really handle ranges of input values specifically (e.g from mean and std dev), because matmuls are linear.

The only use I see is informational (unless I'm missing something).

Also currently the displayed statistics completely ignores the sums of squared activations when the non-squared ones are available (see #14891 (comment)), and so there's a reason to let both paths be possible.

The driver for adding statistics functionality was informational, and to support identifying and ranking which tensors/layers are most influential during inference.

Before #9400, only mean squared activations were accessible so the stats, whilst useful, ignored the direction of change (no minus signs) making them less than ideal, but better than nothing :)

Post #9400, including the mean activations not only yields better statistics, but also opens the door to other possibilities, like being able to estimate quantisation error to programatically choose the best quant types (PR #15550)

To keep the report output manageable, I opted to display the L2 Norm instead of the squared activations (more useful IMO).

compilade · 2025-08-24T15:02:55Z

tools/imatrix/imatrix.cpp

+    if (e.activations.empty()) {
+        activations.reserve(e.values.size());
+
+        for (int i = 0; i < n_mat; ++i) {
+            for (int j = 0; j < row_size; ++j) {
+                activations.push_back(e.values[i*row_size + j] / e.counts[i]);
+            }
+        }
+    } else {
+        activations.reserve(e.activations.size());
+
+        for (int i = 0; i < n_mat; ++i) {
+            for (int j = 0; j < row_size; ++j) {
+                activations.push_back(e.activations[i*row_size + j] / e.counts[i]);
+            }
        }
    }


When the sums of activations are available, this is completely ignoring the sums of squared activations????

All of the new statistics are done over the per-channel means.

This doesn't seem right.

The sums of squared activations are used for quantization importance, and if they're completely ignored, then the statistics are possibly meaningless for importance purposes.

The mean and mean of squared activations together should allow calculating per-channel variance and stuff like that. Not sure how to turn that into per-tensor stats, though it's likely possible somehow.

That's correct, and by design. When available, using mean activations instead yields better statistics since the direction of the change (minus sign) is now available. The ECS stat (dot product of cossim and l2 norm), for example, correctly identifies attn_output and ffn_down as the most sensitive to quantisation. This is not possible with mean of squared activations.

The idea of deriving the per-channel variance through mean and mean of squared activations is quite interesting. I'll look into for a future release.

tools/imatrix/imatrix.cpp

EAddario · 2025-08-24T19:32:32Z

@CISC / @compilade, ready for round 3 :)

I left some comments explaining the rationale behind design decision. Let me know if acceptable.

Use activations to calculate the stats

09bc7c2

EAddario marked this pull request as draft July 26, 2025 16:47

github-actions bot added the examples label Jul 26, 2025

compilade reviewed Jul 26, 2025

View reviewed changes

tools/imatrix/imatrix.cpp Outdated Show resolved Hide resolved

tools/imatrix/imatrix.cpp Show resolved Hide resolved

EAddario added 21 commits July 31, 2025 20:46

Refactor variable names

2097f03

Fix problem up when GGUF does not have in_sum

78ddb47

Determine calculation mode

9744a4a

Compute entropy for activations

cce514a

Compute cosine similarity based on activations

b7fb362

Compute l2 norm

9b841eb

Adjust threshold

ee2509f

Update table display

fc8f925

Remove inactive

4c01f51

Reformat report layout

a32a2ec

Refactor variables

4d1325e

Update table layout

5324558

Refactor lambda into compute_tensor_averages() function

fce05aa

Refactor function names

be60469

Add compute_layer_statistics() function

a6155a8

Update aggregated statistic report layout

2117c4e

Minor cosmetic changes

90cb1be

Fix printing l2 norm when calc_mode = 1

f1c2a4c

Refactor variable name

c39c4e2

Merge branch 'master' into imatrix

adbff66

Do not resize if in_sum is null

5e40cf4

Remove unnecessary include

030ec53

EAddario marked this pull request as draft August 16, 2025 09:56

Fix return type bug

d4b0d89

EAddario marked this pull request as ready for review August 16, 2025 10:01

compilade reviewed Aug 16, 2025

View reviewed changes

EAddario and others added 11 commits August 17, 2025 07:24

Use the corresponding size

e3149a2

Co-authored-by: compilade <[email protected]>

Use { and } around the conditionally-executed statement

4a487ea

Co-authored-by: compilade <[email protected]>

Using one line per variable definition

97d839c

Co-authored-by: compilade <[email protected]>

Use { and } around the conditionally-executed statement

d19e6c9

Co-authored-by: compilade <[email protected]>

Use { and } around single line for statement

12607d3

Define one variable per line and refactor names

a96013f

Use { and } around conditionally-executed single line statements

2e80323

Change statement order

44ea7dd

Merge branch 'imatrix' of https://github.com/EAddario/llama.cpp into …

f6934b9

…imatrix

Avoid using if statements with initialiser

1f72bc1

Validate number of elements if in_sum is present

630750f

EAddario requested a review from compilade August 17, 2025 10:29

Merge branch 'master' into imatrix

5aca256

compilade reviewed Aug 24, 2025

View reviewed changes

Clarify the nature of the calculated cosine similarity

3e26364

EAddario requested a review from compilade August 24, 2025 19:32

EAddario mentioned this pull request Aug 24, 2025

quantize: add option to automatically choose optimal quant types to reach a bpw target at lowest MSE error #15550

Draft

EAddario added 2 commits August 26, 2025 21:53

Add --output-format to usage

69b351b

Add --output-format to usage

6371902

imatrix: calculate activation-based statistics for new format (GGUF) imatrices #14891

Are you sure you want to change the base?

imatrix: calculate activation-based statistics for new format (GGUF) imatrices #14891

Conversation

EAddario commented Jul 26, 2025

Uh oh!

Uh oh!

Uh oh!

jukofyork commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EAddario commented Jul 29, 2025

Uh oh!

jukofyork commented Jul 30, 2025

Uh oh!

jukofyork commented Jul 30, 2025

Uh oh!

compilade left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC commented Aug 16, 2025

Uh oh!

EAddario commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Aug 23, 2025

Idea

How to transform the weights

Why this works (sketch)

RoPE and other position encodings

Caveats and notes

Properties of P

Uh oh!

jukofyork commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EAddario commented Aug 24, 2025

Uh oh!

compilade Aug 24, 2025

Choose a reason for hiding this comment

Uh oh!

EAddario Aug 24, 2025

Choose a reason for hiding this comment

Uh oh!

compilade Aug 24, 2025

Choose a reason for hiding this comment

Uh oh!

EAddario Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

compilade Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EAddario Aug 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

EAddario commented Aug 24, 2025

Uh oh!

Uh oh!

jukofyork commented Jul 29, 2025 •

edited

Loading

jukofyork commented Jul 29, 2025 •

edited

Loading

EAddario commented Aug 16, 2025 •

edited

Loading

jukofyork commented Aug 23, 2025 •

edited

Loading

EAddario Aug 24, 2025 •

edited

Loading

compilade Aug 24, 2025 •

edited

Loading