Shallow Review of Technical AI Safety, 2025

Representation structure and geometry

What do the representations look like? Does any simple structure underlie the beliefs of all well-trained models? Can we get the semantics from this geometry?
Theory of Change:Get scalable unsupervised methods for finding structure in representations and interpreting them, then using this to e.g. guide training.
General Approach:Cognitive
See Also:
Concept-based interpretability, computational mechanics, feature universality, Natural abstractions, Causal Abstractions
Some names:Simplex, Insight + Interaction Lab, Paul Riechers, Adam Shai, Martin Wattenberg, Blake Richards, Mateusz Piotrowski
Estimated FTEs:10-50
Outputs:
The Geometry of Self-Verification in a Task-Specific Reasoning ModelAndrew Lee, Lihao Sun, Chris Wendler, Fernanda Viégas, Martin Wattenberg
Rank-1 LoRAs Encode Interpretable Reasoning SignalsJake Ward, Paul Riechers, Adam Shai
The Geometry of Refusal in Large Language Models: Concept Cones and Representational IndependenceTom Wollschläger, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, Johannes Gasteiger
Embryology of a Language ModelGeorge Wang, Garrett Baker, Andrew Gordon, Daniel Murfet
Constrained belief updates explain geometric structures in transformer representationsMateusz Piotrowski, Paul M. Riechers, Daniel Filan, Adam S. Shai
Shared Global and Local Geometry of Language Model EmbeddingsAndrew Lee, Melanie Weber, Fernanda Viégas, Martin Wattenberg
Neural networks leverage nominally quantum and post-quantum representationsPaul M. Riechers, Thomas J. Elliott, Adam S. Shai
Tracing the Representation Geometry of Language Models from Pretraining to Post-trainingMelody Zixuan Li, Kumar Krishna Agrawal, Arna Ghosh, Komal Kumar Teru, Adam Santoro, Guillaume Lajoie, Blake A. Richards
Deep sequence models tend to memorize geometrically; it is unclear whyShahriar Noroozizadeh, Vaishnavh Nagarajan, Elan Rosenfeld, Sanjiv Kumar
Navigating the Latent Space Dynamics of Neural ModelsMarco Fumero, Luca Moschella, Emanuele Rodolà, Francesco Locatello
Connecting Neural Models Latent Geometries with Relative Geodesic RepresentationsHanlin Yu, Berfin Inal, Georgios Arvanitidis, Soren Hauberg, Francesco Locatello, Marco Fumero
Next-token pretraining implies in-context learningPaul M. Riechers, Henry R. Bigelow, Eric A. Alt, Adam Shai