Shallow Review of Technical AI Safety, 2025

Other corrigibility

Diagnose and communicate obstacles to achieving robustly corrigible behavior; suggest mechanisms, tests, and escalation channels for surfacing and mitigating incorrigible behaviors
Theory of Change:Labs are likely to develop AGI using something analogous to current pipelines. Clarifying why naive instruction-following doesn't buy robust corrigibility + building strong tripwires/diagnostics for scheming and Goodharting thus reduces risks on the likely default path.
Target Case:Pessimistic
Some names:Jeremy Gillen
Estimated FTEs:1-10