Shallow Review of Technical AI Safety, 2025

Causal Abstractions

Verify that a neural network implements a specific high-level causal model (like a logical algorithm) by finding a mapping between high-level variables and low-level neural representations.
Theory of Change:By establishing a causal mapping between a black-box neural network and a human-interpretable algorithm, we can check whether the model is using safe reasoning processes and predict its behavior on unseen inputs, rather than relying on behavioural testing alone.
General Approach:Cognitive
Target Case:Worst Case
Some names:Atticus Geiger, Christopher Potts, Thomas Icard, Theodora-Mara Pîslar, Sara Magliacane, Jiuding Sun, Jing Huang
Estimated FTEs:10-30
Outputs:
HyperDAS: Towards Automating Mechanistic Interpretability with HypernetworksJiuding Sun, Jing Huang, Sidharth Baskaran, Karel D'Oosterlinck, Christopher Potts, Michael Sklar, Atticus Geiger
Combining Causal Models for More Accurate Abstractions of Neural NetworksTheodora-Mara Pîslar, Sara Magliacane, Atticus Geiger
How Causal Abstraction Underpins Computational ExplanationAtticus Geiger, Jacqueline Harding, Thomas Icard