Mild optimisation

Avoid Goodharting by getting AI to satisfice rather than maximise.

Theory of Change:If we fail to exactly nail down the preferences for a superintelligent agent we die to Goodharting → shift from maximising to satisficing in the agent's utility function → we get a nonzero share of the lightcone as opposed to zero; also, moonshot at this being the recipe for fully aligned AI.

General Approach:Cognitive

Orthodox Problems:

1.Value is fragile and hard to specify

Estimated FTEs:10-50

Outputs:

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking— Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, Rohin Shah

BioBlue: Notable runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format— Roland Pihlakas, Sruthi Kuriakose

Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well). Subtleties and Open Challenges.— Roland Pihlakas

From homeostasis to resource sharing: Biologically and economically aligned multi-objective multi-agent gridworld-based AI safety benchmarks— Roland Pihlakas