Asymptotic guarantees
Prove that if a safety process has enough resources (human data quality, training time, neural network capacity), then in the limit some system specification will be guaranteed. Use complexity theory, game theory, learning theory and other areas to both improve asymptotic guarantees and develop ways of showing convergence.
Theory of Change:Formal verification may be too hard. Make safety cases stronger by modelling their processes and proving that they would work in the limit.
General Approach:Cognitive
Target Case:Pessimistic
See Also:
Some names:AISI, Jacob Pfau, Benjamin Hilton, Geoffrey Irving, Simon Marshall, Will Kirby, Martin Soto, David Africa
Estimated FTEs:5 - 10
Critiques:
Self-critique in UK AISI's Alignment Team: Research Agenda
Outputs:
An alignment safety case sketch based on debate— Marie_DB, Jacob Pfau, Benjamin Hilton, Geoffrey Irving
UK AISI's Alignment Team: Research Agenda— Benjamin Hilton, Jacob Pfau, Marie_DB, Geoffrey Irving
Dodging systematic human errors in scalable oversight— Geoffrey Irving