Debate

In the limit, it's easier to compellingly argue for true claims than for false claims; exploit this asymmetry to get trusted work out of untrusted debaters.

Theory of Change:"Give humans help in supervising strong agents" + "Align explanations with the true reasoning process of the agent" + "Red team models to exhibit failure modes that don't occur in normal use" are necessary but probably not sufficient for safe AGI.

Target Case:Worst Case

Orthodox Problems:

1.Value is fragile and hard to specify 7.Superintelligence can fool human supervisors

Some names:Rohin Shah, Jonah Brown-Cohen, Georgios Piliouras

Critiques:

The limits of AI safety via debate (2022)

Outputs:

UK AISI Alignment Team: Debate Sequence— Benjamin Hilton

Prover-Estimator Debate: A New Scalable Oversight Protocol— Jonah Brown-Cohen, Geoffrey Irving

AI Debate Aids Assessment of Controversial Claims— Salman Rahman, Sheriff Issaka, Ashima Suvarna, Genglin Liu, James Shiffer, Jaeyoung Lee, Md Rizwan Parvez, Hamid Palangi, Shi Feng, Nanyun Peng, Yejin Choi, Julian Michael, Liwei Jiang, Saadia Gabriel

An alignment safety case sketch based on debate— Marie Davidsen Buhl, Jacob Pfau, Benjamin Hilton, Geoffrey Irving

Ensemble Debates with Local Large Language Models for AI Alignment— Ephraiem Sarabamoun

LMCA Dataset