Mild optimisation
Avoid Goodharting by getting AI to satisfice rather than maximise.
Theory of Change:If we fail to exactly nail down the preferences for a superintelligent agent we die to Goodharting → shift from maximising to satisficing in the agent's utility function → we get a nonzero share of the lightcone as opposed to zero; also, moonshot at this being the recipe for fully aligned AI.
General Approach:Cognitive
Orthodox Problems:
Estimated FTEs:10-50
Outputs:
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking— Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, Rohin Shah
BioBlue: Notable runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format— Roland Pihlakas, Sruthi Kuriakose