White-box safety (i.e. Interpretability)
This section isn't very conceptually clean. See the Open Problems paper or Deepmind for strong frames which are not useful for descriptive purposes.
Reverse engineering
33 papersDecompose a model into its functional, interacting components (circuits), formally describe what computation those components perform, and validate their causal effects to reverse-engineer the model's internal algorithm.
Extracting latent knowledge
9 papersIdentify and decoding the "true" beliefs or knowledge represented inside a model's activations, even when the model's output is deceptive or false.
Lie and deception detectors
11 papersDetect when a model is being deceptive or lying by building white- or black-box detectors. Some work below requires intent in their definition, while other work focuses only on whether the model states something it believes to be false, regardless of intent.
Model diffing
9 papersUnderstand what happens when a model is finetuned, what the "diff" between the finetuned and the original model consists in.
Sparse Coding
44 papersDecompose the polysemantic activations of the residual stream into a sparse linear combination of monosemantic "features" which correspond to interpretable concepts.
Causal Abstractions
3 papersVerify that a neural network implements a specific high-level causal model (like a logical algorithm) by finding a mapping between high-level variables and low-level neural representations.
Data attribution
12 papersQuantifies the influence of individual training data points on a model's specific behavior or output, allowing researchers to trace model properties (like misalignment, bias, or factual errors) back to their source in the training set.
Pragmatic interpretability
3 papersDirectly tackling concrete, safety-critical problems on the path to AGI by using lightweight interpretability tools (like steering and probing) and empirical feedback from proxy tasks, rather than pursuing complete mechanistic reverse-engineering.
Other interpretability
19 papersInterpretability that does not fall well into other categories.
Learning dynamics and developmental interpretability
14 papersBuilds tools for detecting, locating, and interpreting key structural shifts, phase transitions, and emergent phenomena (like grokking or deception) that occur during a model's training and in-context learning phases.
Representation structure and geometry
13 papersWhat do the representations look like? Does any simple structure underlie the beliefs of all well-trained models? Can we get the semantics from this geometry?
Human inductive biases
6 papersDiscover connections deep learning AI systems have with human brains and human learning processes. Develop an 'alignment moonshot' based on a coherent theory of learning which applies to both humans and AI systems.
Monitoring concepts
11 papersIdentifies directions or subspaces in a model's latent state that correspond to high-level concepts (like refusal, deception, or planning) and uses them to audit models for misalignment, monitor them at runtime, suppress eval awareness, debug why models are failing, etc.
Activation engineering
15 papersProgrammatically modify internal model activations to steer outputs toward desired behaviors; a lightweight, interpretable supplement to fine-tuning.