feat: Add support for acausal crosscoders. by mkbehr · Pull Request #475 · decoderesearch/SAELens

mkbehr · 2025-05-07T02:21:05Z

Description

Add support for acausal crosscoders, as described in https://transformer-circuits.pub/2024/crosscoders/index.html.

Add a new CrosscoderSAE class and related training classes. Crosscoder SAEs take activations from multiple hook points and predict activations from multiple hook points, for an input and output shape of (..., n_layers, d_in).

In this PR, the input and predicted activations can only take from the same sets of layers. This means that the crosscoders have to be acausal, and don't support use_error_term=False or any kind of reconstruction evals. In future changes, we can support causal crosscoders by letting the crosscoder take inputs from one set of hooks and predict another set of hooks. That would also support transcoders as a special case.

The new config parameter hook_names controls whether to use a standard SAE or a crosscoder, and determines which hook points are used. hook_names also causes ActivationsStore to return multiple layers, along the existing layer dimension.

The current inheritance setup in this draft isn't very clean, and doesn't pass type checking. It might be easier to represent this after #425 or #468 are merged.

Type of change

New feature (non-breaking change which adds functionality)
This change requires a documentation update

Checklist:

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing tests pass locally with my changes
I have not rewritten tests relating to key interfaces which would affect backward compatibility

You have tested formatting, typing and tests

I have run make check-ci to check format and linting. (you can run make format to format code if needed.)

Add unit tests for implementing the CrossCoder SAE's ability to collect activations from the same hook type at multiple layers. These tests verify: - Initialization with multiple layers - Activation collection from all layers - Proper buffer and batch handling - Layer-specific normalization - Backward compatibility 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

included tests: - test_crosscoder_sae_init - test_crosscoder_sae_fold_w_dec_norm hook_z excluded from tests

…idden_pre

mkbehr and others added 30 commits May 5, 2025 21:40

Implement multilayer activations store except normalization

2b7a438

CrosscoderSAE implementation, some tests

d1c603b

included tests: - test_crosscoder_sae_init - test_crosscoder_sae_fold_w_dec_norm hook_z excluded from tests

more norm tests

f8bf44e

save-and-load support

ce63c0b

WIP name

51bfac7

TrainingCrosscoderSAE implementation, decoder norm scaling test

d118887

test_TrainingCrosscoderSAE_encode_returns_same_value_as_encode_with_h…

92ae2bd

…idden_pre

test_sae_forward

91190e5

test_sae_forward_with_mse_loss_norm

a73e8a7

mark ghost grads unsupported

f7149f4

fix hook name in tests

229192f

can_add_noise_to_hidden_pre test

8d36da4

b_dec init note

5c694e0

fix from_dict

e38de51

CrosscoderSAETrainer implementation, one test

b2ddc70

two more CrosscoderSAETrainer tests

529109e

test log dict

3935dbc

test_train_sae_group_on_language_model__runs

588de21

fix TrainingCrosscoderSAEConfig.to_dict

2bc00bb

quick name fixes to satisfy wandb

7d54ea4

use crosscoder from training runner

be7f780

initialize W_dec in TrainingCrosscoderSAE

0e6acdc

temporarily hardcode evals off

9cf93ea

training script

c1cfde5

add ActivationsStore.hook_names()

b17b607

l2/sparsity/variance evals for crosscoders

5fb5b49

tiny-stories-1m experiments

05512da

tiny-stories-28m

3d1abbd

minor fixes

9ea92c8

mkbehr added 30 commits May 5, 2025 21:40

scale W_dec init norm

c098be0

scale activations by layer

7c01f2d

some training changes

7957fa9

clean up some TODOs

093bc39

trim CrosscoderSAETrainer

09abaab

TODO notes in crosscoder trainer

e2deb2b

Change hook name syntax from {} to {layer}

f2ea460

fix evals_test

f8107b0

fix activations store test

41dbb5b

fix test_cache_activations_runner

b4e6c0d

fix test_crosscoder_sae

7c090eb

fix crosscoder sae trainer train step log dict

eff955d

Configure crosscoder decoder init norms

406202e

Config rework (most tests fail)

8b397d7

fix test_activations_store_multilayer

190d022

test_crosscoder_sae passes

c114946

training/test*crosscoder* passes

cf82b58

fix evals

9a29cba

fix evals again; all tests pass

7e2a40f

"global" acausal crosscoder script for gpt2-small

e035b8c

remove some TODOs

540e23f

remove more TODOs

109eba8

enable test_activations_store_normalization_multiple_layers

9989065

Update to new disk loader

c1ad393

test saving multilayer activation norm

07449ca

misc. cleanup

9d40b8c

fix format

c631c06

fix some type errors

bcfe7a5

revert changing wandb import line

0f7b349

train crosscoders without override_sae

750ee92

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add support for acausal crosscoders.#475

feat: Add support for acausal crosscoders.#475
mkbehr wants to merge 61 commits intodecoderesearch:mainfrom
mkbehr:crosscoder

mkbehr commented May 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mkbehr commented May 7, 2025

Description

Type of change

Checklist:

You have tested formatting, typing and tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant