Assistance games, assistive agents

Formalize how AI assistants learn about human preferences given uncertainty and partial observability, and construct environments which better incentivize AIs to learn what we want them to learn.

Theory of Change:Understand what kinds of things can go wrong when humans are directly involved in training a model → build tools that make it easier for a model to learn what humans want it to learn.

Orthodox Problems:

1.Value is fragile and hard to specify 10.Humanlike minds/goals are not necessarily safe

See Also:

Guaranteed-Safe AI

Some names:Joar Skalse, Anca Dragan, Caspar Oesterheld, David Krueger, Stuart Russell

Critiques:

nice summary of historical problem statements

Outputs:

Training LLM Agents to Empower Humans— Evan Ellis, Vivek Myers, Jens Tuyls, Sergey Levine, Anca Dragan, Benjamin Eysenbach

Murphys Laws of AI Alignment: Why the Gap Always Wins— Madhava Gaikwad

AssistanceZero: Scalably Solving Assistance Games— Cassidy Laidlaw, Eli Bronstein, Timothy Guo, Dylan Feng, Lukas Berglund, Justin Svegliato, Stuart Russell, Anca Dragan

Observation Interference in Partially Observable Assistance Games— Scott Emmons, Caspar Oesterheld, Vincent Conitzer, Stuart Russell

Learning to Assist Humans without Inferring Rewards— Vivek Myers, Evan Ellis, Sergey Levine, Benjamin Eysenbach, Anca Dragan