Heuristic explanations
Formalize mechanistic explanations of neural network behavior, automate the discovery of these "heuristic explanations" and use them to predict when novel input will lead to extreme behavior (i.e. "Low Probability Estimation" and "Mechanistic Anomaly Detection").
Theory of Change:The current goalpost is methods whose reasoned predictions about properties of a neural network's outputs distribution (for a given inputs distribution) are certifiably at least as accurate as estimations via sampling. If successful for safety-relevant properties, this should allow for automated alignment methods that are both human-legible and worst-case certified, as well more efficient than sampling-based methods in most cases.
Target Case:Worst Case
See Also:
ARC Theory, ELK, mechanistic anomaly detection, Acorn, Guaranteed-Safe AI
Some names:Mark Xu, Eric Neyman, Victor Lecomte, George Robinson
Estimated FTEs:1-10
Critiques:
Outputs:
A computational no-coincidence principle— Eric Neyman
ARC progress update: Competing with sampling— Eric Neyman
Obstacles in ARC's agenda— David Matolcsi
Wide Neural Networks as a Baseline for the Computational No-Coincidence Conjecture— John Dunbar, Scott Aaronson