Shallow Review of Technical AI Safety, 2025

RL safety

Improves the robustness of reinforcement learning agents by addressing core problems in reward learning, goal misgeneralization, and specification gaming.

Theory of Change:Standard RL objectives (like maximizing expected value) are brittle and lead to goal misgeneralization or specification gaming; by developing more robust frameworks (like pessimistic RL, minimax regret, or provable inverse reward learning), we can create agents that are safe even when misspecified.

General Approach:Engineering

Target Case:Pessimistic

Orthodox Problems:

4.Goals misgeneralize out of distribution 1.Value is fragile and hard to specify 7.Superintelligence can fool human supervisors

See Also:

Behavior alignment theory, Assistance games, assistive agents, Goal robustness, Iterative alignment, Mild optimisation, scalable oversight, The Theoretical Reward Learning Research Agenda: Introduction and Motivation

Some names:Joar Skalse, Karim Abdel Sadek, Matthew Farrugia-Roberts, Benjamin Plaut, Fang Wu, Stephen Zhao, Alessandro Abate, Steven Byrnes, Michael Cohen

Estimated FTEs:20-70

Critiques:

"The Era of Experience" has an unsolved technical alignment problem, The Invisible Leash: Why RLVR May or May Not Escape Its Origin

Outputs:

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret— Lukas Fluri, Leon Lang, Alessandro Abate, Patrick Forré, David Krueger, Joar Skalse

Safe Learning Under Irreversible Dynamics via Asking for Help— Benjamin Plaut, Juan Liévano-Karim, Hanlin Zhu, Stuart Russell

Mitigating Goal Misgeneralization via Minimax Regret— Karim Abdel Sadek, Matthew Farrugia-Roberts, Usman Anwar, Hannah Erlebach, Christian Schroeder de Witt, David Krueger, Michael Dennis

Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?— Xueru Wen, Jie Lou, Yaojie Lu, Hongyu Lin, Xing Yu, Xinyu Lu, Ben He, Xianpei Han, Debing Zhang, Le Sun

The Invisible Leash: Why RLVR May or May Not Escape Its Origin— Fang Wu, Weihao Xuan, Ximing Lu, Mingjie Liu, Yi Dong, Zaid Harchaoui, Yejin Choi

Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference— Stephen Zhao, Aidan Li, Rob Brekelmans, Roger Grosse

Interpreting Emergent Planning in Model-Free Reinforcement Learning— Thomas Bush, Stephen Chung, Usman Anwar, Adrià Garriga-Alonso, David Krueger

Misalignment from Treating Means as Ends— Henrik Marklund, Alex Infanger, Benjamin Van Roy

"The Era of Experience" has an unsolved technical alignment problem— Steven Byrnes

Safety cases for Pessimism— Michael Cohen

We need a field of Reward Function Design— Steven Byrnes