Natural abstractions
Develop a theory of concepts that explains how they are learned, how they structure a particular system's understanding, and how mutual translatability can be achieved between different collections of concepts.
Theory of Change:Understand the concepts a system's understanding is structured with and use them to inspect its "alignment/safety properties" and/or "retarget its search", i.e. identify utility-function-like components inside an AI and replacing calls to them with calls to "user values" (represented using existing abstractions inside the AI).
General Approach:Cognitive
Target Case:Worst Case
See Also:
Causal Abstractions, representational alignment, convergent abstractions, feature universality, Platonic representation hypothesis, microscope AI
Some names:John Wentworth, Paul Colognese, David Lorrell, Sam Eisenstat
Estimated FTEs:1-10
Critiques:
Outputs:
Condensation— abramdemski
The Platonic Representation Hypothesis— Minyoung Huh, Brian Cheung, Tongzhou Wang, Phillip Isola
Fernando Rosas: Identifying Abstractions (HAAISS 2025)— Fernando Rosas
Natural Latents: Latent Variables Stable Across Ontologies— John Wentworth, David Lorell
Condensation: a theory of concepts— Sam Eisenstat
Factored space models: Towards causality between levels of abstraction— Scott Garrabrant, Matthias Georg Mayer, Magdalena Wache, Leon Lang, Sam Eisenstat, Holger Dell
Getting aligned on representational alignment— Ilia Sucholutsky, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C. Love, Christopher J. Cueva, Erin Grant, Iris Groen, Jascha Achterberg, Joshua B. Tenenbaum, Katherine M. Collins, Katherine L. Hermann, Kerem Oktar, Klaus Greff, Martin N. Hebart, Nathan Cloos, Nikolaus Kriegeskorte, Nori Jacoby, Qiuyi Zhang, Raja Marjieh, Robert Geirhos, Sherol Chen, Simon Kornblith, Sunayana Rane, Talia Konkle, Thomas P. O'Connell, Thomas Unterthiner, Andrew K. Lampinen, Klaus-Robert Müller, Mariya Toneva, Thomas L. Griffiths
Symmetries at the origin of hierarchical emergence— Fernando E. Rosas