Other corrigibility
Diagnose and communicate obstacles to achieving robustly corrigible behavior; suggest mechanisms, tests, and escalation channels for surfacing and mitigating incorrigible behaviors
Theory of Change:Labs are likely to develop AGI using something analogous to current pipelines. Clarifying why naive instruction-following doesn't buy robust corrigibility + building strong tripwires/diagnostics for scheming and Goodharting thus reduces risks on the likely default path.
Target Case:Pessimistic
Orthodox Problems:
See Also:
Some names:Jeremy Gillen
Estimated FTEs:1-10
Outputs:
Detect Goodhart and shut down— Jeremy Gillen
Oblivious Defense in ML Models: Backdoor Removal without Detection— Shafi Goldwasser, Jonathan Shafer, Neekon Vafa, Vinod Vaikuntanathan
Cryptographic Backdoor for Neural Networks: Boon and Bane— Anh Tu Ngo, Anupam Chattopadhyay, Subhamoy Maitra
A Cryptographic Perspective on Mitigation vs. Detection in Machine Learning— Greg Gluch, Shafi Goldwasser