Assistance games, assistive agents
Formalize how AI assistants learn about human preferences given uncertainty and partial observability, and construct environments which better incentivize AIs to learn what we want them to learn.
Theory of Change:Understand what kinds of things can go wrong when humans are directly involved in training a model → build tools that make it easier for a model to learn what humans want it to learn.
See Also:
Some names:Joar Skalse, Anca Dragan, Caspar Oesterheld, David Krueger, Stuart Russell
Critiques:
nice summary of historical problem statements
Outputs:
Training LLM Agents to Empower Humans— Evan Ellis, Vivek Myers, Jens Tuyls, Sergey Levine, Anca Dragan, Benjamin Eysenbach
Murphys Laws of AI Alignment: Why the Gap Always Wins— Madhava Gaikwad
AssistanceZero: Scalably Solving Assistance Games— Cassidy Laidlaw, Eli Bronstein, Timothy Guo, Dylan Feng, Lukas Berglund, Justin Svegliato, Stuart Russell, Anca Dragan
Observation Interference in Partially Observable Assistance Games— Scott Emmons, Caspar Oesterheld, Vincent Conitzer, Stuart Russell
Learning to Assist Humans without Inferring Rewards— Vivek Myers, Evan Ellis, Sergey Levine, Benjamin Eysenbach, Anca Dragan