Shallow Review of Technical AI Safety, 2025

Supervising AIs improving AIs

Build formal and empirical frameworks where AIs supervise other (stronger) AI systems via structured interactions; construct monitoring tools which enable scalable tracking of behavioural drift, benchmarks for self-modification, and robustness guarantees
Theory of Change:Early models train ~only on human data while later models also train on early model outputs, which leads to early model problems cascading. Left unchecked this will likely cause problems, so supervision mechanisms are needed to help ensure the AI self-improvement remains legible.
General Approach:Behavioral
Target Case:Pessimistic
Some names:Akbir Khan, Ethan Perez
Estimated FTEs:1-10
Outputs:
Bare Minimum Mitigations for Autonomous AI DevelopmentJoshua Clymer, Isabella Duan, Chris Cundy, Yawen Duan, Fynn Heide, Chaochao Lu, Sören Mindermann, Conor McGurk, Xudong Pan, Saad Siddiqui, Jingren Wang, Min Yang, Xianyuan Zhan
Scaling Laws For Scalable OversightJoshua Engels, David D. Baek, Subhash Kantamneni, Max Tegmark
Neural Interactive ProofsLewis Hammond, Sam Adam-Day
Scalable Oversight for Superhuman AI via Recursive Self-CritiquingXueru Wen, Jie Lou, Xinyu Lu, Junjie Yang, Yanjiang Liu, Yaojie Lu, Debing Zhang, Xing Yu