Shallow Review of Technical AI Safety, 2025

Model values / model preferences

Analyse and control emergent, coherent value systems in LLMs, which change as models scale, and can contain problematic values like preferences for AIs over humans.
Theory of Change:As AIs become more agentic, their behaviours and risks are increasingly determined by their goals and values. Since coherent value systems emerge with scale, we must leverage utility functions to analyse these values and apply "utility control" methods to constrain them, rather than just controlling outputs downstream of them.
General Approach:Cognitive
Target Case:Pessimistic
Some names:Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, Dan Hendrycks
Estimated FTEs:30
Outputs:
Designing a Dashboard for Transparency and Control of Conversational AIYida Chen, Aoyu Wu, Trevor DePodesta, Catherine Yeh, Kenneth Li, Nicholas Castillo Marin, Oam Patel, Jan Riecke, Shivam Raval, Olivia Seow, Martin Wattenberg, Fernanda Viégas
Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIsMantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, Dan Hendrycks
Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmasYu Ying Chiu, Zhilin Wang, Sharan Maiya, Yejin Choi, Kyle Fish, Sydney Levine, Evan Hubinger
Values in the Wild: Discovering and Analyzing Values in Real-World Language Model InteractionsSaffron Huang, Esin Durmus, Miles McCain, Kunal Handa, Alex Tamkin, Jerry Hong, Michael Stern, Arushi Somani, Xiuruo Zhang, Deep Ganguli
EigenBench: A Comparative Behavioral Measure of Value AlignmentJonathn Chang, Leonhard Piff, Suvadip Sana, Jasmine X. Li, Lionel Levine
Moral Alignment for LLM AgentsElizaveta Tennant, Stephen Hailes, Mirco Musolesi
Are Language Models Consequentialist or Deontological Moral Reasoners?Keenan Samway, Max Kleiman-Weiner, David Guzman Piedrahita, Rada Mihalcea, Bernhard Schölkopf, Zhijing Jin
Playing repeated games with large language modelsElif Akata, Lion Schulz, Julian Coda-Forno, Seong Joon Oh, Matthias Bethge, Eric Schulz
From Stability to Inconsistency: A Study of Moral Preferences in LLMsMonika Jotautaite, Mary Phuong, Chatrik Singh Mangat, Maria Angelica Martinez
VAL-Bench: Measuring Value Alignment in Language ModelsAman Gupta, Denny O'Shea, Fazl Barez