RL safety
Improves the robustness of reinforcement learning agents by addressing core problems in reward learning, goal misgeneralization, and specification gaming.
Theory of Change:Standard RL objectives (like maximizing expected value) are brittle and lead to goal misgeneralization or specification gaming; by developing more robust frameworks (like pessimistic RL, minimax regret, or provable inverse reward learning), we can create agents that are safe even when misspecified.
General Approach:Engineering
Target Case:Pessimistic
Some names:Joar Skalse, Karim Abdel Sadek, Matthew Farrugia-Roberts, Benjamin Plaut, Fang Wu, Stephen Zhao, Alessandro Abate, Steven Byrnes, Michael Cohen
Estimated FTEs:20-70
Outputs:
The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret— Lukas Fluri, Leon Lang, Alessandro Abate, Patrick Forré, David Krueger, Joar Skalse
Safe Learning Under Irreversible Dynamics via Asking for Help— Benjamin Plaut, Juan Liévano-Karim, Hanlin Zhu, Stuart Russell
Mitigating Goal Misgeneralization via Minimax Regret— Karim Abdel Sadek, Matthew Farrugia-Roberts, Usman Anwar, Hannah Erlebach, Christian Schroeder de Witt, David Krueger, Michael Dennis
Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?— Xueru Wen, Jie Lou, Yaojie Lu, Hongyu Lin, Xing Yu, Xinyu Lu, Ben He, Xianpei Han, Debing Zhang, Le Sun
The Invisible Leash: Why RLVR May or May Not Escape Its Origin— Fang Wu, Weihao Xuan, Ximing Lu, Mingjie Liu, Yi Dong, Zaid Harchaoui, Yejin Choi
Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference— Stephen Zhao, Aidan Li, Rob Brekelmans, Roger Grosse
Interpreting Emergent Planning in Model-Free Reinforcement Learning— Thomas Bush, Stephen Chung, Usman Anwar, Adrià Garriga-Alonso, David Krueger
Misalignment from Treating Means as Ends— Henrik Marklund, Alex Infanger, Benjamin Van Roy
Safety cases for Pessimism— Michael Cohen
We need a field of Reward Function Design— Steven Byrnes