Skip to content

feat: Add support for acausal crosscoders.#475

Draft
mkbehr wants to merge 61 commits intodecoderesearch:mainfrom
mkbehr:crosscoder
Draft

feat: Add support for acausal crosscoders.#475
mkbehr wants to merge 61 commits intodecoderesearch:mainfrom
mkbehr:crosscoder

Conversation

@mkbehr
Copy link
Contributor

@mkbehr mkbehr commented May 7, 2025

Description

Add support for acausal crosscoders, as described in https://transformer-circuits.pub/2024/crosscoders/index.html.

Add a new CrosscoderSAE class and related training classes. Crosscoder SAEs take activations from multiple hook points and predict activations from multiple hook points, for an input and output shape of (..., n_layers, d_in).

In this PR, the input and predicted activations can only take from the same sets of layers. This means that the crosscoders have to be acausal, and don't support use_error_term=False or any kind of reconstruction evals. In future changes, we can support causal crosscoders by letting the crosscoder take inputs from one set of hooks and predict another set of hooks. That would also support transcoders as a special case.

The new config parameter hook_names controls whether to use a standard SAE or a crosscoder, and determines which hook points are used. hook_names also causes ActivationsStore to return multiple layers, along the existing layer dimension.

The current inheritance setup in this draft isn't very clean, and doesn't pass type checking. It might be easier to represent this after #425 or #468 are merged.

Type of change

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Checklist:

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing tests pass locally with my changes
  • I have not rewritten tests relating to key interfaces which would affect backward compatibility

You have tested formatting, typing and tests

  • I have run make check-ci to check format and linting. (you can run make format to format code if needed.)

mkbehr and others added 30 commits May 5, 2025 21:40
Add unit tests for implementing the CrossCoder SAE's ability to collect activations
from the same hook type at multiple layers. These tests verify:
- Initialization with multiple layers
- Activation collection from all layers
- Proper buffer and batch handling
- Layer-specific normalization
- Backward compatibility

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
included tests:
- test_crosscoder_sae_init
- test_crosscoder_sae_fold_w_dec_norm

hook_z excluded from tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant