Natural abstractions

Develop a theory of concepts that explains how they are learned, how they structure a particular system's understanding, and how mutual translatability can be achieved between different collections of concepts.

Theory of Change:Understand the concepts a system's understanding is structured with and use them to inspect its "alignment/safety properties" and/or "retarget its search", i.e. identify utility-function-like components inside an AI and replacing calls to them with calls to "user values" (represented using existing abstractions inside the AI).

General Approach:Cognitive

Target Case:Worst Case

Orthodox Problems:

5.Instrumental convergence 7.Superintelligence can fool human supervisors 9.Humans cannot be first-class parties to a superintelligent value handshake

See Also:

Causal Abstractions, representational alignment, convergent abstractions, feature universality, Platonic representation hypothesis, microscope AI

Some names:John Wentworth, Paul Colognese, David Lorrell, Sam Eisenstat

Estimated FTEs:1-10

Critiques:

Chan et al (2023), Soto, Harwood, Soares (2023)

Outputs:

Abstract mathematical concepts vs abstractions over real

Condensation— abramdemski

The Platonic Representation Hypothesis— Minyoung Huh, Brian Cheung, Tongzhou Wang, Phillip Isola

Fernando Rosas: Identifying Abstractions (HAAISS 2025)— Fernando Rosas

Natural Latents: Latent Variables Stable Across Ontologies— John Wentworth, David Lorell

Condensation: a theory of concepts— Sam Eisenstat

Factored space models: Towards causality between levels of abstraction— Scott Garrabrant, Matthias Georg Mayer, Magdalena Wache, Leon Lang, Sam Eisenstat, Holger Dell

A single principle related to many Alignment subproblems?— Q Home

Getting aligned on representational alignment— Ilia Sucholutsky, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C. Love, Christopher J. Cueva, Erin Grant, Iris Groen, Jascha Achterberg, Joshua B. Tenenbaum, Katherine M. Collins, Katherine L. Hermann, Kerem Oktar, Klaus Greff, Martin N. Hebart, Nathan Cloos, Nikolaus Kriegeskorte, Nori Jacoby, Qiuyi Zhang, Raja Marjieh, Robert Geirhos, Sherol Chen, Simon Kornblith, Sunayana Rane, Talia Konkle, Thomas P. O'Connell, Thomas Unterthiner, Andrew K. Lampinen, Klaus-Robert Müller, Mariya Toneva, Thomas L. Griffiths

Symmetries at the origin of hierarchical emergence— Fernando E. Rosas