Learning dynamics and developmental interpretability
Builds tools for detecting, locating, and interpreting key structural shifts, phase transitions, and emergent phenomena (like grokking or deception) that occur during a model's training and in-context learning phases.
Theory of Change:Structures forming in neural networks leave identifiable traces that can be interpreted (e.g., using concepts from Singular Learning Theory); by catching and analyzing these developmental moments, researchers can automate interpretability, predict when dangerous capabilities emerge, and intervene to prevent deceptiveness or misaligned values as early as possible.
General Approach:Cognitive
Target Case:Worst Case
Orthodox Problems:
See Also:
Some names:Jesse Hoogland
Estimated FTEs:10-50
Critiques:
Outputs:
Understanding and Controlling LLM Generalization— Daniel Tan
SLT for AI Safety— Jesse Hoogland
Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining— Deniz Bayazit, Aaron Mueller, Antoine Bosselut
A Review of Developmental Interpretability in Large Language Models— Ihor Kendiukhov
Dynamics of Transient Structure in In-Context Linear Regression Transformers— Liam Carroll, Jesse Hoogland, Matthew Farrugia-Roberts, Daniel Murfet
Learning Coefficients, Fractals, and Trees in Parameter Space— Max Hennick, Matthias Dellago
Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought— Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, Yuandong Tian
Compressibility Measures Complexity: Minimum Description Length Meets Singular Learning Theory— Einar Urdshals, Edmund Lau, Jesse Hoogland, Stan van Wingerden, Daniel Murfet
Programs as Singularities— Daniel Murfet, William Troiani
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning?— Katie Kang, Amrith Setlur, Dibya Ghosh, Jacob Steinhardt, Claire Tomlin, Sergey Levine, Aviral Kumar
Modes of Sequence Models and Learning Coefficients— Zhongtian Chen, Daniel Murfet
Programming by Backprop: LLMs Acquire Reusable Algorithmic Abstractions During Code Training— Jonathan Cook, Silvia Sapora, Arash Ahmadian, Akbir Khan, Tim Rocktaschel, Jakob Foerster, Laura Ruis