Supervising AIs improving AIs

Build formal and empirical frameworks where AIs supervise other (stronger) AI systems via structured interactions; construct monitoring tools which enable scalable tracking of behavioural drift, benchmarks for self-modification, and robustness guarantees

Theory of Change:Early models train ~only on human data while later models also train on early model outputs, which leads to early model problems cascading. Left unchecked this will likely cause problems, so supervision mechanisms are needed to help ensure the AI self-improvement remains legible.

General Approach:Behavioral

Target Case:Pessimistic

Orthodox Problems:

7.Superintelligence can fool human supervisors 8.Superintelligence can hack software supervisors

Some names:Akbir Khan, Ethan Perez

Estimated FTEs:1-10

Critiques:

Automation collapse, Great Models Think Alike and this Undermines AI Oversight

Outputs:

Bare Minimum Mitigations for Autonomous AI Development— Joshua Clymer, Isabella Duan, Chris Cundy, Yawen Duan, Fynn Heide, Chaochao Lu, Sören Mindermann, Conor McGurk, Xudong Pan, Saad Siddiqui, Jingren Wang, Min Yang, Xianyuan Zhan

Dodging systematic human errors in scalable oversight— Geoffrey Irving

Scaling Laws For Scalable Oversight— Joshua Engels, David D. Baek, Subhash Kantamneni, Max Tegmark

Neural Interactive Proofs— Lewis Hammond, Sam Adam-Day

Modeling Human Beliefs about AI Behavior for Scalable Oversight— Leon Lang, Patrick Forré

Scalable Oversight for Superhuman AI via Recursive Self-Critiquing— Xueru Wen, Jie Lou, Xinyu Lu, Junjie Yang, Yanjiang Liu, Yaojie Lu, Debing Zhang, Xing Yu

Video and transcript of talk on automating alignment research— Joe Carlsmith

Maintaining Alignment during RSI as a Feedback Control Problem— beren