Causal Abstractions
Verify that a neural network implements a specific high-level causal model (like a logical algorithm) by finding a mapping between high-level variables and low-level neural representations.
Theory of Change:By establishing a causal mapping between a black-box neural network and a human-interpretable algorithm, we can check whether the model is using safe reasoning processes and predict its behavior on unseen inputs, rather than relying on behavioural testing alone.
General Approach:Cognitive
Target Case:Worst Case
Orthodox Problems:
Some names:Atticus Geiger, Christopher Potts, Thomas Icard, Theodora-Mara Pîslar, Sara Magliacane, Jiuding Sun, Jing Huang
Estimated FTEs:10-30
Outputs:
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks— Jiuding Sun, Jing Huang, Sidharth Baskaran, Karel D'Oosterlinck, Christopher Potts, Michael Sklar, Atticus Geiger
Combining Causal Models for More Accurate Abstractions of Neural Networks— Theodora-Mara Pîslar, Sara Magliacane, Atticus Geiger
How Causal Abstraction Underpins Computational Explanation— Atticus Geiger, Jacqueline Harding, Thomas Icard