SAE | John Lambert

Table of Contents:

Overview

Overview

Sparse autoencoders provide a unsupervised approach for extracting interpretable features from a language model by reconstructing activations from a sparse bottleneck layer.

\(k\)-Sparse Autoencoders Background

Training:

Perform the feedforward phase and compute: \(\boldsymbol{z} = W^\top \boldsymbol{x} + \boldsymbol{b}\)
Find the \(k\) largest activations of \(\boldsymbol{z}\) and set the rest to zero: \(\boldsymbol{z}_{(\Gamma)^c} = 0 \quad \text{where} \quad \Gamma = \operatorname{supp}_k(\boldsymbol{z})\)
Compute the output and the error using the sparsified \(\boldsymbol{z}\): \(\hat{\boldsymbol{x}} = W \boldsymbol{z} + \boldsymbol{b}'\) \(E = \|\boldsymbol{x} - \hat{\boldsymbol{x}}\|_2^2\)
Backpropagate the error through the \(k\) largest activations defined by \(\Gamma\) and iterate.

Sparse Encoding:

Compute the features \(\boldsymbol{h} = W^\top \boldsymbol{x} + \boldsymbol{b}\). Find its \(\alpha k\) largest activations and set the rest to zero: \(\boldsymbol{h}_{(\Gamma)^c} = 0 \quad \text{where} \quad \Gamma = \operatorname{supp}_{\alpha k}(\boldsymbol{h})\)

This differs from a ReLU Autoencoder:

\[\begin{align} z &= \mathrm{ReLU}(W_{enc}(x − b_{pre}) + b_{enc}) \\ \hat{x} &= W_{dec}z + b_{pre} \end{align}\]

(Wang et al, 2025) use an SAE to force the target model to produce internal representations atℓ-th layer that are “compatible” with the frozen SAE with rich reasoning features. Since the SAE was trained to reconstruct the reasoning-focused activations of the source model, this objective pushes the target model’s activations to become explicitly similar to the source model’s internal reasoning structure.

References

Alireza Makhzani, Brendan Frey. K-sparse autoencoders. 2013. [PDF].
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey. Sparse Autoencoders Find Highly Interpretable Features in Language Models. 2023. [PDF].
Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu. *Scaling and evaluating sparse autoencoders. 2024. [PDF].
Shangshang Wang, Julian Asilis, Ömer Faruk Akgül, Enes Burak Bilgin, Ollie Liu, Deqing Fu, and Willie Neiswanger. Resa: Transparent Reasoning Models via SAEs. 2025. [PDF].
Anthropic. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemanticfeatures/index.html.
Anthropic. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet. Transformer Circuits Thread, 2024.