Behavior alignment theory
Predict properties of future AGI (e.g. power-seeking) with formal models; formally state and prove hypotheses about the properties powerful systems will have and how we might try to change them.
Theory of Change:Figure out hypotheses about properties powerful agents will have → attempt to rigorously prove under what conditions the hypotheses hold → test these hypotheses where feasible → design training environments that lead to more salutary properties.
Target Case:Worst Case
Orthodox Problems:
See Also:
Some names:Ram Potham, Michael K. Cohen, John Wentworth
Estimated FTEs:1-10
Critiques:
Ryan Greenblatt's criticism of one behavioural proposal
Outputs:
Serious Flaws in CAST— Max Harms
A Shutdown Problem Proposal— johnswentworth, David Lorell
Shutdownable Agents through POST-Agency— Elliott Thornley
The Partially Observable Off-Switch Game— Andrew Garber, Rohan Subramani, Linus Luu, Mark Bedaywi, Stuart Russell, Scott Emmons
Imitation learning is probably existentially safe— Michael K. Cohen, Marcus Hutter
Model-Based Soft Maximization of Suitable Metrics of Long-Term Human Power— Jobst Heitzig, Ram Potham
Deceptive Alignment and Homuncularity— Oliver Sourbut, TurnTrout