Shallow Review of Technical AI Safety, 2025

Natural abstractions

Develop a theory of concepts that explains how they are learned, how they structure a particular system's understanding, and how mutual translatability can be achieved between different collections of concepts.
Theory of Change:Understand the concepts a system's understanding is structured with and use them to inspect its "alignment/safety properties" and/or "retarget its search", i.e. identify utility-function-like components inside an AI and replacing calls to them with calls to "user values" (represented using existing abstractions inside the AI).
General Approach:Cognitive
Target Case:Worst Case
See Also:
Causal Abstractions, representational alignment, convergent abstractions, feature universality, Platonic representation hypothesis, microscope AI
Some names:John Wentworth, Paul Colognese, David Lorrell, Sam Eisenstat
Estimated FTEs:1-10
Outputs:
Condensationabramdemski
The Platonic Representation HypothesisMinyoung Huh, Brian Cheung, Tongzhou Wang, Phillip Isola
Factored space models: Towards causality between levels of abstractionScott Garrabrant, Matthias Georg Mayer, Magdalena Wache, Leon Lang, Sam Eisenstat, Holger Dell
Getting aligned on representational alignmentIlia Sucholutsky, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C. Love, Christopher J. Cueva, Erin Grant, Iris Groen, Jascha Achterberg, Joshua B. Tenenbaum, Katherine M. Collins, Katherine L. Hermann, Kerem Oktar, Klaus Greff, Martin N. Hebart, Nathan Cloos, Nikolaus Kriegeskorte, Nori Jacoby, Qiuyi Zhang, Raja Marjieh, Robert Geirhos, Sherol Chen, Simon Kornblith, Sunayana Rane, Talia Konkle, Thomas P. O'Connell, Thomas Unterthiner, Andrew K. Lampinen, Klaus-Robert Müller, Mariya Toneva, Thomas L. Griffiths