feat: Matryoshka auxiliary loss by chanind · Pull Request #643 · decoderesearch/SAELens

chanind · 2026-02-10T23:49:23Z

This PR adds an optional new auxiliary loss for Matryoshka SAEs, where the auxiliary loss tries to revive dead features into the residual of the level they are at, rather than into the residual of the SAE as a whole.

claude · 2026-02-10T23:50:55Z

PR Review: Matryoshka Auxiliary Loss

This PR adds an optional auxiliary loss for Matryoshka SAEs that encourages dead features to reconstruct the residual of their own level rather than the entire SAE. The implementation is well-structured with comprehensive test coverage. Here's my feedback:

Code Quality & Best Practices

Positive aspects:

Clean separation of concerns with the new _iterable_decode generator method
Good reuse of existing calculate_topk_aux_acts function (now properly exported from topk_sae.py)
Proper use of @override decorator
Configuration is backward-compatible with use_matryoshka_aux_loss defaulting to False
Documentation is clear about the tradeoff (better revival vs slower training)

Issues to address:

Potential gradient flow issue (matryoshka_batchtopk_sae.py:175): The residual calculation uses .detach():
```
residual = (sae_in - partial_sae_out).detach()
```
This prevents gradients from flowing back through partial_sae_out to earlier levels. While this matches the base class behavior in topk_sae.py:458, it means the auxiliary loss won't influence the reconstruction of earlier levels within the same matryoshka level. This may be intentional, but worth documenting why.
Inefficient recomputation (matryoshka_batchtopk_sae.py:161): When use_matryoshka_aux_loss=True, the method recomputes acts = self.activation_fn(hidden_pre) even though these activations were already computed in the forward pass. Consider passing them as a parameter to avoid redundant computation.
Memory inefficiency in aux loss (matryoshka_batchtopk_sae.py:187): The reconstruction adds the full b_dec bias for each level:
```
recons = auxk_acts @ self.W_dec[prev_width:width] + self.b_dec
```
However, b_dec is already included in partial_sae_out (since it's computed via _iterable_decode). This means the bias is being added multiple times unnecessarily. This should likely just be:
```
recons = auxk_acts @ self.W_dec[prev_width:width]
```
without the bias term, since we're computing the residual against partial_sae_out which already includes it.
Type annotation cleanup (tests/helpers.py:578): The change from TrainingSAEConfigDict to dict[str, Any] is correct (since matryoshka config extends it), but the comment and variable names could be clearer about why this changed.

Test Coverage

Excellent test coverage:

Manual computation verification ✓
Tests that dead features target their level's residual (not full SAE residual) ✓
Gradient flow validation ✓
Flag toggle behavior ✓
Edge cases (no dead features, selective dead features, skipped levels) ✓

Test quality issues:

test_matryoshka_aux_loss_manual_computation (line 421): This test is excellent for verifying correctness, but if the bias addition issue (point 3 above) is fixed, this test will need updating to match.
test_matryoshka_aux_loss_gradients_flow_through_own_level_weights (line 149): The test verifies that gradients only flow to the appropriate level, which is good. However, it uses pytest.approx(0.0, abs=1e-7) which is overly strict and could cause flakiness. Consider using a slightly looser tolerance like abs=1e-6.
All tests use rescale_acts_by_decoder_norm=False: Consider adding at least one test with rescale_acts_by_decoder_norm=True to ensure the new code path handles the normalization correctly in _iterable_decode.

Performance Considerations

Training overhead: The _iterable_decode method is called multiple times per training step when both training_forward_pass and calculate_topk_aux_loss are invoked with use_matryoshka_aux_loss=True. The forward pass iterates once (line 110-120), and aux loss iterates again (line 164-190). Consider caching partial reconstructions if this becomes a bottleneck.
Sparse tensor handling: Good use of sparse matmul when appropriate (line 96-100).

Security & Correctness

Critical issue:

The bias term addition in the aux loss reconstruction (point 3 above) appears incorrect and will lead to wrong gradient signals for dead feature revival. This needs verification and likely fixing.

Minor issues:

Edge case: If matryoshka_widths has only one level (which triggers a warning), calculate_topk_aux_loss with use_matryoshka_aux_loss=True will still iterate over all levels. This is correct but could have a comment explaining the behavior.

Recommendations

Must fix:

Verify and fix the bias term addition in aux loss (item 3 in Code Quality section)

Should fix:

Pass acts to avoid recomputation in aux loss
Document why residual uses .detach()
Add test with rescale_acts_by_decoder_norm=True

Nice to have:

Consider caching partial reconstructions if performance becomes an issue
Loosen the gradient test tolerance slightly

Summary

This is a solid implementation with thorough testing. The main concern is the potential bias term error in the auxiliary loss reconstruction, which could affect the correctness of dead feature revival. Once that's verified/fixed, this should be ready to merge.

codecov · 2026-02-10T23:58:48Z

Codecov Report

❌ Patch coverage is 96.29630% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
sae_lens/saes/matryoshka_batchtopk_sae.py	96.00%	1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

Adds an optional Matryoshka-specific auxiliary loss for BatchTopK SAEs, where dead features are trained to reconstruct the level-local residual rather than the full SAE residual.

Changes:

Introduces use_matryoshka_aux_loss config flag and Matryoshka-specific aux-loss implementation.
Exposes calculate_topk_aux_acts as a public helper and updates call sites.
Adds unit tests covering manual computation, residual targeting behavior, gradient flow, and flag-off parity.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File	Description
sae_lens/saes/matryoshka_batchtopk_sae.py	Adds iterable level decoding + new Matryoshka aux loss behind a config flag.
sae_lens/saes/topk_sae.py	Renames/exports aux-acts helper to reuse from Matryoshka implementation.
tests/saes/test_matryoshka_batchtopk_sae.py	Adds tests validating new aux-loss behavior + gradients + parity.
tests/helpers.py	Updates helper config typing and adds default for new flag.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-12T16:31:38Z

sae_lens/saes/matryoshka_batchtopk_sae.py

+                )
+
+                # Reduce the scale of the loss if there are a small number of dead latents
+                scale = min(partial_num_dead / partial_k_aux, 1.0)


The scaling factor is effectively always 1.0 because partial_k_aux = min(k_aux, partial_num_dead) makes partial_num_dead / partial_k_aux >= 1. In calculate_topk_aux_loss the scale is computed before clamping k_aux to num_dead. To match that intended behavior (and your tests' stated intent), compute scale using the unclamped k_aux (or compute scale before setting partial_k_aux).

Suggested change

scale = min(partial_num_dead / partial_k_aux, 1.0)

scale = min(partial_num_dead / k_aux, 1.0)

Copilot · 2026-02-12T16:31:39Z

sae_lens/saes/matryoshka_batchtopk_sae.py

+                # top k living latents
+                recons = auxk_acts @ self.W_dec[prev_width:width]


The aux-loss reconstruction path is inconsistent with the Matryoshka decode path used to compute partial_sae_out (via _iterable_decode). _iterable_decode may rescale activations by decoder norm and applies run_time_activation_norm_fn_out (and starts from b_dec), but recons here does not. This makes the target residual and the predicted reconstruction live in different 'decode conventions', and also ignores rescale_acts_by_decoder_norm for the aux path. Consider constructing recons using the same conventions as _iterable_decode for a single level (including any act rescaling and output activation-norm steps, and whether b_dec should be included) so that (recons - residual) is meaningful and consistent across configs.

Suggested change

# top k living latents

recons = auxk_acts @ self.W_dec[prev_width:width]

# top k living latents. Use the same decode conventions as the main

# Matryoshka path (activation rescaling, bias, and output norm).

scaled_auxk_acts = auxk_acts

if getattr(self.cfg, "rescale_acts_by_decoder_norm", False) and hasattr(

self, "decoder_norms"

):

scaled_auxk_acts = scaled_auxk_acts * self.decoder_norms[

prev_width:width

]

recons = scaled_auxk_acts @ self.W_dec[prev_width:width]

if hasattr(self, "b_dec") and self.b_dec is not None:

recons = recons + self.b_dec

if hasattr(self, "run_time_activation_norm_fn_out") and (

self.run_time_activation_norm_fn_out is not None

):

recons = self.run_time_activation_norm_fn_out(recons)

Copilot · 2026-02-12T16:31:39Z

sae_lens/saes/matryoshka_batchtopk_sae.py

+        inv_W_dec_norm = 1 / self.W_dec.norm(dim=-1)
+        if self.cfg.rescale_acts_by_decoder_norm:


inv_W_dec_norm is computed unconditionally even when rescale_acts_by_decoder_norm is False. This adds an extra norm+division on every call, and can produce inf when rows of W_dec are zero (your tests set W_dec[0:3] = 0.0). Compute inv_W_dec_norm only inside the if, and consider clamping the norm with an epsilon to avoid inf when rescaling is enabled.

Suggested change

inv_W_dec_norm = 1 / self.W_dec.norm(dim=-1)

if self.cfg.rescale_acts_by_decoder_norm:

if self.cfg.rescale_acts_by_decoder_norm:

# Clamp norm to avoid division by zero when some rows of W_dec are zero.

W_dec_norm = self.W_dec.norm(dim=-1)

inv_W_dec_norm = 1.0 / torch.clamp(W_dec_norm, min=1e-12)

Copilot · 2026-02-12T16:31:39Z

sae_lens/saes/topk_sae.py

+def calculate_topk_aux_acts(
    k_aux: int,
    hidden_pre: torch.Tensor,
    dead_neuron_mask: torch.Tensor,


Renaming _calculate_topk_aux_acts to calculate_topk_aux_acts is a breaking change for any internal/external callers that still import the underscored name. To preserve compatibility, consider keeping _calculate_topk_aux_acts as a thin alias to calculate_topk_aux_acts (optionally with a deprecation warning) for at least one release cycle.

Copilot · 2026-02-12T16:31:39Z