Guaranteed-Safe AI

Have an AI system generate outputs (e.g. code, control systems, or RL policies) which it can quantitatively guarantee comply with a formal safety specification and world model.

Theory of Change:Various, including: i) safe deployment: create a scalable process to get not-fully-trusted AIs to produce highly trusted outputs; ii) secure containers: create a 'gatekeeper' system that can act as an intermediary between human users and a potentially dangerous system, only letting provably safe actions through. (Notable for not requiring that we solve ELK; does require that we solve ontology though)

Target Case:Worst Case

Orthodox Problems:

1.Value is fragile and hard to specify 4.Goals misgeneralize out of distribution 7.Superintelligence can fool human supervisors 9.Humans cannot be first-class parties to a superintelligent value handshake 12.A boxed AGI might exfiltrate itself