Representation structure and geometry
What do the representations look like? Does any simple structure underlie the beliefs of all well-trained models? Can we get the semantics from this geometry?
Theory of Change:Get scalable unsupervised methods for finding structure in representations and interpreting them, then using this to e.g. guide training.
General Approach:Cognitive
See Also:
Concept-based interpretability, computational mechanics, feature universality, Natural abstractions, Causal Abstractions
Some names:Simplex, Insight + Interaction Lab, Paul Riechers, Adam Shai, Martin Wattenberg, Blake Richards, Mateusz Piotrowski
Estimated FTEs:10-50
Outputs:
The Geometry of Self-Verification in a Task-Specific Reasoning Model— Andrew Lee, Lihao Sun, Chris Wendler, Fernanda Viégas, Martin Wattenberg
Rank-1 LoRAs Encode Interpretable Reasoning Signals— Jake Ward, Paul Riechers, Adam Shai
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence— Tom Wollschläger, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, Johannes Gasteiger
Embryology of a Language Model— George Wang, Garrett Baker, Andrew Gordon, Daniel Murfet
Constrained belief updates explain geometric structures in transformer representations— Mateusz Piotrowski, Paul M. Riechers, Daniel Filan, Adam S. Shai
Shared Global and Local Geometry of Language Model Embeddings— Andrew Lee, Melanie Weber, Fernanda Viégas, Martin Wattenberg
Neural networks leverage nominally quantum and post-quantum representations— Paul M. Riechers, Thomas J. Elliott, Adam S. Shai
Tracing the Representation Geometry of Language Models from Pretraining to Post-training— Melody Zixuan Li, Kumar Krishna Agrawal, Arna Ghosh, Komal Kumar Teru, Adam Santoro, Guillaume Lajoie, Blake A. Richards
Deep sequence models tend to memorize geometrically; it is unclear why— Shahriar Noroozizadeh, Vaishnavh Nagarajan, Elan Rosenfeld, Sanjiv Kumar
Navigating the Latent Space Dynamics of Neural Models— Marco Fumero, Luca Moschella, Emanuele Rodolà, Francesco Locatello
The Geometry of ReLU Networks through the ReLU Transition Graph— Sahil Rajesh Dhayalkar
Connecting Neural Models Latent Geometries with Relative Geodesic Representations— Hanlin Yu, Berfin Inal, Georgios Arvanitidis, Soren Hauberg, Francesco Locatello, Marco Fumero
Next-token pretraining implies in-context learning— Paul M. Riechers, Henry R. Bigelow, Eric A. Alt, Adam Shai