Shallow Review of Technical AI Safety, 2025

Behavior alignment theory

Predict properties of future AGI (e.g. power-seeking) with formal models; formally state and prove hypotheses about the properties powerful systems will have and how we might try to change them.
Theory of Change:Figure out hypotheses about properties powerful agents will have → attempt to rigorously prove under what conditions the hypotheses hold → test these hypotheses where feasible → design training environments that lead to more salutary properties.
Target Case:Worst Case
Some names:Ram Potham, Michael K. Cohen, John Wentworth
Estimated FTEs:1-10
Critiques:
Ryan Greenblatt's criticism of one behavioural proposal