Table of Contents:

Activation Intervention Overview

Mny approaches to affect the output of a pretrained LLM have been proposed:

  • Intervening on weights, as with supervised finetuning, RLHF, steerable layers, and weight editing
  • Intervening at decoding, as with guided or trainable decoding
  • Intervening on the prompt, as with automated prompt engineering
  • Intervening on token embeddings, as with ‘soft prompting’
  • Intervening on activations, for instance by freezing the weights of the LLM and searching for a ‘steering vector’ of activations.

In this post, we will focus on the last approach – activation intervention, which suggests that the value a concept takes on can be changed, without changing other concepts, by adding a suitable steering vector (Park et al, 2024).

ActAdd (Turner, 2023) manipulates the residual stream as follows:

  • takes a pair of natural-language prompts (\(p_+, p_−\)), where \(p_+\) represents the property we wish output text to emphasise (e.g. love) and \(p_−\) represents its opposite (e.g. hate or indifference).
  • \(h_+^l\) is the activation vector for the prompt \(p_+\) at layer \(l\).
  • The difference \(h_+^l − h_{-}^l\) is a new activation vector which (intuitively) captures the difference between a prompt with the target property, and a prompt without it.
  • The steering vector is computed before inference time

Wang

How can we control reasoning models?

Reasoning models exhibit many behaviors, such as:

  • backtracking: The model abandons its current approach and explores an alternative strategy. In othe words, after progressing down a reasoning path or coming up with a candidate answer, a model will explore alternative strategies.
  • expressing uncertainty: The model explicitly states its confidence or uncertainty regarding its reasoning.
  • generating examples for hypothesis validation

(Ward, 2025) study backtracking through steering vectors computed from reasoning model residual stream activations at layer 10.

Difference of Means

The Difference of Means method is a widely used technique for extracting steering vectors in LLMs (Turner) (Marks). This technique is based on constructing contrastive datasets that differ in a specific concept and computing the difference in their mean activations of a model. Formally, let \(D_+\) and \(D_−\) be two datasets where samples in D+ exhibit a given concept, while samples in \(D_−\) do not.

Given a model component, one can compute the Difference of Means vector:

\[u = \frac{1}{\lvert D_{+} \rvert} \sum_{p_i \in D_{+}} a_(p_i) - \frac{1}{\lvert D_{-} \rvert} \sum_{p_j \in D_{-}} a_(p_j),\]

where \(a(p_i)\) and \(a(p_j)\) represent the activations of the model components over the prompts from the respective datasets. This vector \(u\) captures the primary direction in activation space that differentiates the two datasets with respect to the target concept. One can do so from every layer (Venhoff, 2025) as:

\[u_\ell^{c} = \frac{1}{\lvert D_{+} \rvert} \sum_{p_i \in D_{+}} \bar{a}_\ell^{c}(p_i) - \frac{1}{\lvert D_{-} \rvert} \sum_{p_j \in D_{-}} a_\ell^{c}(p_j), \qquad \text{with} \qquad \bar{a}_\ell^{c}(p_i) = \frac{1}{\lvert \mathrm{seq}_c(p_i) \rvert} \sum_{t \in \mathrm{seq}_c(p_i)} a_\ell(t).\]

where \(\mathrm{seq}_c(p)\) is the set of all token sequences within prompt \(p\) that are annotated with category \(c\), including the preceding token position.

References

  1. Jake Ward, Chuqiao Lin, Constantin Venhoff, Neel Nanda. Reasoning-Finetuning Repurposes Latent Representations in Base Models. ICML 2025 Workshop on Actionable Interpretability. [PDF].
  2. Constantin Venhoff, Iván Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda. Understanding Reasoning in Thinking Language Models via Steering Vectors. Workshop on Reasoning and Planning for Large Language Models at ICLR 2025. [PDF].
  3. Kiho Park, Yo Joong Choe, Victor Veitch. The Linear Representation Hypothesis and the Geometry of Large Language Models. [PDF].
  4. Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, Monte MacDiarmid. Steering Language Models With Activation Engineering. arXiv, 2023. [PDF]. 5. 6.