Supervising AIs improving AIs
Build formal and empirical frameworks where AIs supervise other (stronger) AI systems via structured interactions; construct monitoring tools which enable scalable tracking of behavioural drift, benchmarks for self-modification, and robustness guarantees
Theory of Change:Early models train ~only on human data while later models also train on early model outputs, which leads to early model problems cascading. Left unchecked this will likely cause problems, so supervision mechanisms are needed to help ensure the AI self-improvement remains legible.
General Approach:Behavioral
Target Case:Pessimistic
Some names:Akbir Khan, Ethan Perez
Estimated FTEs:1-10
Outputs:
Bare Minimum Mitigations for Autonomous AI Development— Joshua Clymer, Isabella Duan, Chris Cundy, Yawen Duan, Fynn Heide, Chaochao Lu, Sören Mindermann, Conor McGurk, Xudong Pan, Saad Siddiqui, Jingren Wang, Min Yang, Xianyuan Zhan
Dodging systematic human errors in scalable oversight— Geoffrey Irving
Scaling Laws For Scalable Oversight— Joshua Engels, David D. Baek, Subhash Kantamneni, Max Tegmark
Neural Interactive Proofs— Lewis Hammond, Sam Adam-Day
Modeling Human Beliefs about AI Behavior for Scalable Oversight— Leon Lang, Patrick Forré
Scalable Oversight for Superhuman AI via Recursive Self-Critiquing— Xueru Wen, Jie Lou, Xinyu Lu, Junjie Yang, Yanjiang Liu, Yaojie Lu, Debing Zhang, Xing Yu