Shallow Review of Technical AI Safety, 2025

Learning dynamics and developmental interpretability

Builds tools for detecting, locating, and interpreting key structural shifts, phase transitions, and emergent phenomena (like grokking or deception) that occur during a model's training and in-context learning phases.
Theory of Change:Structures forming in neural networks leave identifiable traces that can be interpreted (e.g., using concepts from Singular Learning Theory); by catching and analyzing these developmental moments, researchers can automate interpretability, predict when dangerous capabilities emerge, and intervene to prevent deceptiveness or misaligned values as early as possible.
General Approach:Cognitive
Target Case:Worst Case
Some names:Jesse Hoogland
Estimated FTEs:10-50
Outputs:
SLT for AI SafetyJesse Hoogland
Dynamics of Transient Structure in In-Context Linear Regression TransformersLiam Carroll, Jesse Hoogland, Matthew Farrugia-Roberts, Daniel Murfet
Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous ThoughtHanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, Yuandong Tian
Compressibility Measures Complexity: Minimum Description Length Meets Singular Learning TheoryEinar Urdshals, Edmund Lau, Jesse Hoogland, Stan van Wingerden, Daniel Murfet
Programs as SingularitiesDaniel Murfet, William Troiani
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning?Katie Kang, Amrith Setlur, Dibya Ghosh, Jacob Steinhardt, Claire Tomlin, Sergey Levine, Aviral Kumar
Programming by Backprop: LLMs Acquire Reusable Algorithmic Abstractions During Code TrainingJonathan Cook, Silvia Sapora, Arash Ahmadian, Akbir Khan, Tim Rocktaschel, Jakob Foerster, Laura Ruis