Shallow Review of Technical AI Safety, 2025

Aligning to context

Align AI directly to the role of participant, collaborator, or advisor for our best real human practices and institutions, instead of aligning AI to separately representable goals, rules, or utility functions.
Theory of Change:"Many classical problems in AGI alignment are downstream of a type error about human values." Operationalizing a correct view of human values - one that treats human values as impossible or impractical to abstract from concrete practices - will unblock value fragility, goal-misgeneralization, instrumental convergence, and pivotal-act specification.
General Approach:Behavioral
Some names:Full Stack Alignment, Meaning Alignment Institute, Tan Zhi-Xuan, Matija Franklin, Ryan Lowe, Joe Edelman, Oliver Klingefjord
Estimated FTEs:5
Outputs:
A theory of appropriateness with applications to generative artificial intelligenceJoel Z. Leibo, Alexander Sasha Vezhnevets, Manfred Diaz, John P. Agapiou, William A. Cunningham, Peter Sunehag, Julia Haas, Raphael Koster, Edgar A. Duéñez-Guzmán, William S. Isaac, Georgios Piliouras, Stanley M. Bileschi, Iyad Rahwan, Simon Osindero
What are human values, and how do we align AI to them?Oliver Klingefjord, Ryan Lowe, Joe Edelman
Model IntegrityJoe Edelman, Oliver Klingefjord
Beyond Preferences in AI AlignmentTan Zhi-Xuan, Micah Carroll, Matija Franklin, Hal Ashton
Can AI Model the Complexities of Human Moral Decision-Making? A Qualitative Study of Kidney Allocation DecisionsVijay Keswani, Vincent Conitzer, Walter Sinnott-Armstrong, Breanna K. Nguyen, Hoda Heidari, Jana Schaich Borg