Debate
In the limit, it's easier to compellingly argue for true claims than for false claims; exploit this asymmetry to get trusted work out of untrusted debaters.
Theory of Change:"Give humans help in supervising strong agents" + "Align explanations with the true reasoning process of the agent" + "Red team models to exhibit failure modes that don't occur in normal use" are necessary but probably not sufficient for safe AGI.
Target Case:Worst Case
Some names:Rohin Shah, Jonah Brown-Cohen, Georgios Piliouras
Outputs:
UK AISI Alignment Team: Debate Sequence— Benjamin Hilton
Prover-Estimator Debate: A New Scalable Oversight Protocol— Jonah Brown-Cohen, Geoffrey Irving
AI Debate Aids Assessment of Controversial Claims— Salman Rahman, Sheriff Issaka, Ashima Suvarna, Genglin Liu, James Shiffer, Jaeyoung Lee, Md Rizwan Parvez, Hamid Palangi, Shi Feng, Nanyun Peng, Yejin Choi, Julian Michael, Liwei Jiang, Saadia Gabriel
An alignment safety case sketch based on debate— Marie Davidsen Buhl, Jacob Pfau, Benjamin Hilton, Geoffrey Irving
Ensemble Debates with Local Large Language Models for AI Alignment— Ephraiem Sarabamoun