Shallow Review of Technical AI Safety, 2025

Debate

In the limit, it's easier to compellingly argue for true claims than for false claims; exploit this asymmetry to get trusted work out of untrusted debaters.
Theory of Change:"Give humans help in supervising strong agents" + "Align explanations with the true reasoning process of the agent" + "Red team models to exhibit failure modes that don't occur in normal use" are necessary but probably not sufficient for safe AGI.
Target Case:Worst Case
Some names:Rohin Shah, Jonah Brown-Cohen, Georgios Piliouras
Outputs:
AI Debate Aids Assessment of Controversial ClaimsSalman Rahman, Sheriff Issaka, Ashima Suvarna, Genglin Liu, James Shiffer, Jaeyoung Lee, Md Rizwan Parvez, Hamid Palangi, Shi Feng, Nanyun Peng, Yejin Choi, Julian Michael, Liwei Jiang, Saadia Gabriel
An alignment safety case sketch based on debateMarie Davidsen Buhl, Jacob Pfau, Benjamin Hilton, Geoffrey Irving