Shallow Review of Technical AI Safety, 2025

Behavior alignment theory

Predict properties of future AGI (e.g. power-seeking) with formal models; formally state and prove hypotheses about the properties powerful systems will have and how we might try to change them.

Theory of Change:Figure out hypotheses about properties powerful agents will have → attempt to rigorously prove under what conditions the hypotheses hold → test these hypotheses where feasible → design training environments that lead to more salutary properties.

Target Case:Worst Case

Orthodox Problems:

2.Corrigibility is anti-natural 5.Instrumental convergence

See Also:

Agent foundations, Control

Some names:Ram Potham, Michael K. Cohen, John Wentworth

Estimated FTEs:1-10

Critiques:

Ryan Greenblatt's criticism of one behavioural proposal

Outputs:

Preference gaps as a safeguard against AI self-replication— tbs, EJT

Serious Flaws in CAST— Max Harms

A Shutdown Problem Proposal— johnswentworth, David Lorell

Shutdownable Agents through POST-Agency— Elliott Thornley

The Partially Observable Off-Switch Game— Andrew Garber, Rohan Subramani, Linus Luu, Mark Bedaywi, Stuart Russell, Scott Emmons

Imitation learning is probably existentially safe— Michael K. Cohen, Marcus Hutter

Model-Based Soft Maximization of Suitable Metrics of Long-Term Human Power— Jobst Heitzig, Ram Potham

Deceptive Alignment and Homuncularity— Oliver Sourbut, TurnTrout

A Safety Case for a Deployed LLM: Corrigibility as a Singular Target— Ram Potham

LLM AGI will have memory, and memory changes alignment— Seth Herd