Shallow Review of Technical AI Safety, 2025

Other corrigibility

Diagnose and communicate obstacles to achieving robustly corrigible behavior; suggest mechanisms, tests, and escalation channels for surfacing and mitigating incorrigible behaviors

Theory of Change:Labs are likely to develop AGI using something analogous to current pipelines. Clarifying why naive instruction-following doesn't buy robust corrigibility + building strong tripwires/diagnostics for scheming and Goodharting thus reduces risks on the likely default path.

Target Case:Pessimistic

Orthodox Problems:

2.Corrigibility is anti-natural 5.Instrumental convergence

See Also:

Behavior alignment theory

Some names:Jeremy Gillen

Estimated FTEs:1-10

Outputs:

AI Assistants Should Have a Direct Line to Their Developers— Jan_Kulveit

Detect Goodhart and shut down— Jeremy Gillen

Instrumental goals are a different and friendlier kind of [content unavailable]

Shutdownable Agents through POST-Agency— EJT

Why Corrigibility is Hard and Important (i.e. "Whence the high MIRI confidence in alignment difficulty?")

Oblivious Defense in ML Models: Backdoor Removal without Detection— Shafi Goldwasser, Jonathan Shafer, Neekon Vafa, Vinod Vaikuntanathan

Cryptographic Backdoor for Neural Networks: Boon and Bane— Anh Tu Ngo, Anupam Chattopadhyay, Subhamoy Maitra

A Cryptographic Perspective on Mitigation vs. Detection in Machine Learning— Greg Gluch, Shafi Goldwasser

Problems with instruction-following as an alignment target— Seth Herd