Shallow Review of Technical AI Safety, 2025

Learning dynamics and developmental interpretability

Builds tools for detecting, locating, and interpreting key structural shifts, phase transitions, and emergent phenomena (like grokking or deception) that occur during a model's training and in-context learning phases.

Theory of Change:Structures forming in neural networks leave identifiable traces that can be interpreted (e.g., using concepts from Singular Learning Theory); by catching and analyzing these developmental moments, researchers can automate interpretability, predict when dangerous capabilities emerge, and intervene to prevent deceptiveness or misaligned values as early as possible.

General Approach:Cognitive

Target Case:Worst Case

Orthodox Problems:

4.Goals misgeneralize out of distribution

See Also:

Reverse engineering, Sparse Coding, ICL transience

Some names:Jesse Hoogland

Estimated FTEs:10-50

Critiques:

Vaintrob, Joar Skalse (2023)

Outputs:

From SLT to AIT: NN Generalisation Out of Distribution

Understanding and Controlling LLM Generalization— Daniel Tan

SLT for AI Safety— Jesse Hoogland

Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining— Deniz Bayazit, Aaron Mueller, Antoine Bosselut

A Review of Developmental Interpretability in Large Language Models— Ihor Kendiukhov

Dynamics of Transient Structure in In-Context Linear Regression Transformers— Liam Carroll, Jesse Hoogland, Matthew Farrugia-Roberts, Daniel Murfet

Learning Coefficients, Fractals, and Trees in Parameter Space— Max Hennick, Matthias Dellago

Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought— Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, Yuandong Tian

Compressibility Measures Complexity: Minimum Description Length Meets Singular Learning Theory— Einar Urdshals, Edmund Lau, Jesse Hoogland, Stan van Wingerden, Daniel Murfet

Programs as Singularities— Daniel Murfet, William Troiani

What Do Learning Dynamics Reveal About Generalization in LLM Reasoning?— Katie Kang, Amrith Setlur, Dibya Ghosh, Jacob Steinhardt, Claire Tomlin, Sergey Levine, Aviral Kumar

Selective regularization for alignment-focused representation engineering— Sandy Fraser

Modes of Sequence Models and Learning Coefficients— Zhongtian Chen, Daniel Murfet

Programming by Backprop: LLMs Acquire Reusable Algorithmic Abstractions During Code Training— Jonathan Cook, Silvia Sapora, Arash Ahmadian, Akbir Khan, Tim Rocktaschel, Jakob Foerster, Laura Ruis