Heuristic explanations

Formalize mechanistic explanations of neural network behavior, automate the discovery of these "heuristic explanations" and use them to predict when novel input will lead to extreme behavior (i.e. "Low Probability Estimation" and "Mechanistic Anomaly Detection").

Theory of Change:The current goalpost is methods whose reasoned predictions about properties of a neural network's outputs distribution (for a given inputs distribution) are certifiably at least as accurate as estimations via sampling. If successful for safety-relevant properties, this should allow for automated alignment methods that are both human-legible and worst-case certified, as well more efficient than sampling-based methods in most cases.

Target Case:Worst Case

Orthodox Problems:

4.Goals misgeneralize out of distribution 8.Superintelligence can hack software supervisors