Weak-to-strong generalization
Use weaker models to supervise and provide a feedback signal to stronger models.
Theory of Change:Find techniques that do better than RLHF at supervising superior models → track whether these techniques fail as capabilities increase further → keep the stronger systems aligned by amplifying weak oversight and quantifying where it breaks.
General Approach:Engineering
Target Case:Average Case
Orthodox Problems:
Some names:Joshua Engels, Nora Belrose, David D. Baek
Estimated FTEs:2-20
Outputs:
Scaling Laws For Scalable Oversight— Joshua Engels, David D. Baek, Subhash Kantamneni, Max Tegmark
Great Models Think Alike and this Undermines AI Oversight— Shashwat Goel, Joschka Struber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, Jonas Geiping
Debate Helps Weak-to-Strong Generalization— Hao Lang, Fei Huang, Yongbin Li
Understanding the Capabilities and Limitations of Weak-to-Strong Generalization— Wei Yao, Wenkai Yang, Ziqiao Wang, Yankai Lin, Yong Liu