Shallow Review of Technical AI Safety, 2025

Assistance games, assistive agents

Formalize how AI assistants learn about human preferences given uncertainty and partial observability, and construct environments which better incentivize AIs to learn what we want them to learn.
Theory of Change:Understand what kinds of things can go wrong when humans are directly involved in training a model → build tools that make it easier for a model to learn what humans want it to learn.
Some names:Joar Skalse, Anca Dragan, Caspar Oesterheld, David Krueger, Stuart Russell
Critiques:
nice summary of historical problem statements
Outputs:
Training LLM Agents to Empower HumansEvan Ellis, Vivek Myers, Jens Tuyls, Sergey Levine, Anca Dragan, Benjamin Eysenbach
AssistanceZero: Scalably Solving Assistance GamesCassidy Laidlaw, Eli Bronstein, Timothy Guo, Dylan Feng, Lukas Berglund, Justin Svegliato, Stuart Russell, Anca Dragan
Observation Interference in Partially Observable Assistance GamesScott Emmons, Caspar Oesterheld, Vincent Conitzer, Stuart Russell
Learning to Assist Humans without Inferring RewardsVivek Myers, Evan Ellis, Sergey Levine, Benjamin Eysenbach, Anca Dragan