Shallow Review of Technical AI Safety, 2025

RL safety

Improves the robustness of reinforcement learning agents by addressing core problems in reward learning, goal misgeneralization, and specification gaming.
Theory of Change:Standard RL objectives (like maximizing expected value) are brittle and lead to goal misgeneralization or specification gaming; by developing more robust frameworks (like pessimistic RL, minimax regret, or provable inverse reward learning), we can create agents that are safe even when misspecified.
General Approach:Engineering
Target Case:Pessimistic
Some names:Joar Skalse, Karim Abdel Sadek, Matthew Farrugia-Roberts, Benjamin Plaut, Fang Wu, Stephen Zhao, Alessandro Abate, Steven Byrnes, Michael Cohen
Estimated FTEs:20-70
Outputs:
The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low RegretLukas Fluri, Leon Lang, Alessandro Abate, Patrick Forré, David Krueger, Joar Skalse
Safe Learning Under Irreversible Dynamics via Asking for HelpBenjamin Plaut, Juan Liévano-Karim, Hanlin Zhu, Stuart Russell
Mitigating Goal Misgeneralization via Minimax RegretKarim Abdel Sadek, Matthew Farrugia-Roberts, Usman Anwar, Hannah Erlebach, Christian Schroeder de Witt, David Krueger, Michael Dennis
Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?Xueru Wen, Jie Lou, Yaojie Lu, Hongyu Lin, Xing Yu, Xinyu Lu, Ben He, Xianpei Han, Debing Zhang, Le Sun
The Invisible Leash: Why RLVR May or May Not Escape Its OriginFang Wu, Weihao Xuan, Ximing Lu, Mingjie Liu, Yi Dong, Zaid Harchaoui, Yejin Choi
Interpreting Emergent Planning in Model-Free Reinforcement LearningThomas Bush, Stephen Chung, Usman Anwar, Adrià Garriga-Alonso, David Krueger
Misalignment from Treating Means as EndsHenrik Marklund, Alex Infanger, Benjamin Van Roy