Shallow Review of Technical AI Safety, 2025

Weak-to-strong generalization

Use weaker models to supervise and provide a feedback signal to stronger models.
Theory of Change:Find techniques that do better than RLHF at supervising superior models → track whether these techniques fail as capabilities increase further → keep the stronger systems aligned by amplifying weak oversight and quantifying where it breaks.
General Approach:Engineering
Target Case:Average Case
Some names:Joshua Engels, Nora Belrose, David D. Baek
Estimated FTEs:2-20
Outputs:
Scaling Laws For Scalable OversightJoshua Engels, David D. Baek, Subhash Kantamneni, Max Tegmark
Great Models Think Alike and this Undermines AI OversightShashwat Goel, Joschka Struber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, Jonas Geiping
Debate Helps Weak-to-Strong GeneralizationHao Lang, Fei Huang, Yongbin Li
Understanding the Capabilities and Limitations of Weak-to-Strong GeneralizationWei Yao, Wenkai Yang, Ziqiao Wang, Yankai Lin, Yong Liu