Model values / model preferences
Analyse and control emergent, coherent value systems in LLMs, which change as models scale, and can contain problematic values like preferences for AIs over humans.
Theory of Change:As AIs become more agentic, their behaviours and risks are increasingly determined by their goals and values. Since coherent value systems emerge with scale, we must leverage utility functions to analyse these values and apply "utility control" methods to constrain them, rather than just controlling outputs downstream of them.
General Approach:Cognitive
Target Case:Pessimistic
Orthodox Problems:
Some names:Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, Dan Hendrycks
Estimated FTEs:30
Outputs:
Designing a Dashboard for Transparency and Control of Conversational AI— Yida Chen, Aoyu Wu, Trevor DePodesta, Catherine Yeh, Kenneth Li, Nicholas Castillo Marin, Oam Patel, Jan Riecke, Shivam Raval, Olivia Seow, Martin Wattenberg, Fernanda Viégas
Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs— Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, Dan Hendrycks
Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas— Yu Ying Chiu, Zhilin Wang, Sharan Maiya, Yejin Choi, Kyle Fish, Sydney Levine, Evan Hubinger
The PacifAIst Benchmark:Would an Artificial Intelligence Choose to Sacrifice Itself for Human Safety?— Manuel Herrador
Values in the Wild: Discovering and Analyzing Values in Real-World Language Model Interactions— Saffron Huang, Esin Durmus, Miles McCain, Kunal Handa, Alex Tamkin, Jerry Hong, Michael Stern, Arushi Somani, Xiuruo Zhang, Deep Ganguli
EigenBench: A Comparative Behavioral Measure of Value Alignment— Jonathn Chang, Leonhard Piff, Suvadip Sana, Jasmine X. Li, Lionel Levine
Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs— Ling Hu, Yuemei Xu, Xiaoyang Gu, Letao Han
Alignment Can Reduce Performance on Simple Ethical Questions— Daan Henselmans
Moral Alignment for LLM Agents— Elizaveta Tennant, Stephen Hailes, Mirco Musolesi
Are Language Models Consequentialist or Deontological Moral Reasoners?— Keenan Samway, Max Kleiman-Weiner, David Guzman Piedrahita, Rada Mihalcea, Bernhard Schölkopf, Zhijing Jin
Playing repeated games with large language models— Elif Akata, Lion Schulz, Julian Coda-Forno, Seong Joon Oh, Matthias Bethge, Eric Schulz
From Stability to Inconsistency: A Study of Moral Preferences in LLMs— Monika Jotautaite, Mary Phuong, Chatrik Singh Mangat, Maria Angelica Martinez
VAL-Bench: Measuring Value Alignment in Language Models— Aman Gupta, Denny O'Shea, Fazl Barez