Shallow Review of Technical AI Safety, 2025

White-box safety (i.e. Interpretability)

This section isn't very conceptually clean. See the Open Problems paper or Deepmind for strong frames which are not useful for descriptive purposes.

Reverse engineering

33 papers

Decompose a model into its functional, interacting components (circuits), formally describe what computation those components perform, and validate their causal effects to reverse-engineer the model's internal algorithm.

Extracting latent knowledge

9 papers

Identify and decoding the "true" beliefs or knowledge represented inside a model's activations, even when the model's output is deceptive or false.

Lie and deception detectors

11 papers

Detect when a model is being deceptive or lying by building white- or black-box detectors. Some work below requires intent in their definition, while other work focuses only on whether the model states something it believes to be false, regardless of intent.

Model diffing

9 papers

Understand what happens when a model is finetuned, what the "diff" between the finetuned and the original model consists in.

Sparse Coding

44 papers

Decompose the polysemantic activations of the residual stream into a sparse linear combination of monosemantic "features" which correspond to interpretable concepts.

Causal Abstractions

3 papers

Verify that a neural network implements a specific high-level causal model (like a logical algorithm) by finding a mapping between high-level variables and low-level neural representations.

Data attribution

12 papers

Quantifies the influence of individual training data points on a model's specific behavior or output, allowing researchers to trace model properties (like misalignment, bias, or factual errors) back to their source in the training set.

Pragmatic interpretability

3 papers

Directly tackling concrete, safety-critical problems on the path to AGI by using lightweight interpretability tools (like steering and probing) and empirical feedback from proxy tasks, rather than pursuing complete mechanistic reverse-engineering.

Other interpretability

19 papers

Interpretability that does not fall well into other categories.

Learning dynamics and developmental interpretability

14 papers

Builds tools for detecting, locating, and interpreting key structural shifts, phase transitions, and emergent phenomena (like grokking or deception) that occur during a model's training and in-context learning phases.

Representation structure and geometry

13 papers

What do the representations look like? Does any simple structure underlie the beliefs of all well-trained models? Can we get the semantics from this geometry?

Human inductive biases

6 papers

Discover connections deep learning AI systems have with human brains and human learning processes. Develop an 'alignment moonshot' based on a coherent theory of learning which applies to both humans and AI systems.

Concept-based interpretability

Monitoring concepts

11 papers

Identifies directions or subspaces in a model's latent state that correspond to high-level concepts (like refusal, deception, or planning) and uses them to audit models for misalignment, monitor them at runtime, suppress eval awareness, debug why models are failing, etc.

Activation engineering

15 papers

Programmatically modify internal model activations to steer outputs toward desired behaviors; a lightweight, interpretable supplement to fine-tuning.