Iterative alignment at post-train-time
Modify weights after pre-training.
Theory of Change:"LLMs don't seem very dangerous and might scale to AGI, things are generally smooth, relevant capabilities are harder than alignment, assume no mesaoptimisers, assume that zero-shot deception is hard, assume a fundamentally humanish ontology is learned, assume no simulated agents, assume that noise in the data means that human preferences are not ruled out, assume that alignment is a superficial feature, assume that tuning for what we want will also get us to avoid what we don't want. Maybe assume that thoughts are translucent."
General Approach:Engineering
Target Case:Average Case
Some names:Anca Dragan, Jacob Steinhardt, Rohin Shah
Outputs:
Composable Interventions for Language Models— Arinbjorn Kolbeinsson, Kyle O'Brien, Tianjin Huang, Shanghua Gao, Shiwei Liu, Jonathan Richard Schwarz, Anurag Vaidya, Faisal Mahmood, Marinka Zitnik, Tianlong Chen, Thomas Hartvigsen
Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives— Chloe Li, Mary Phuong, Daniel Tan
On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback— Marcus Williams, Micah Carroll, Adhyyan Narang, Constantin Weisser, Brendan Murphy, Anca Dragan
Preference Learning with Lie Detectors can Induce Honesty or Evasion— Chris Cundy, Adam Gleave
Robust LLM Alignment via Distributionally Robust Direct Preference Optimization— Zaiyan Xu, Sushil Vemuri, Kishan Panaganti, Dileep Kalathil, Rahul Jain, Deepak Ramachandran
RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation— Kaiqu Liang, Haimin Hu, Ryan Liu, Thomas L. Griffiths, Jaime Fernández Fisac
Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference— Stephen Zhao, Aidan Li, Rob Brekelmans, Roger Grosse
Iterative Label Refinement Matters More than Preference Optimization under Weak Supervision— Yaowen Ye, Cassidy Laidlaw, Jacob Steinhardt
Consistency Training Helps Stop Sycophancy and Jailbreaks— Alex Irpan, Alexander Matt Turner, Mark Kurzeja, David K. Elson, Rohin Shah
Rethinking Safety in LLM Fine-tuning: An Optimization Perspective— Minseon Kim, Jin Myung Kwak, Lama Alssum, Bernard Ghanem, Philip Torr, David Krueger, Fazl Barez, Adel Bibi
Preference Learning for AI Alignment: a Causal Perspective— Katarzyna Kobalczyk, Mihaela van der Schaar
On Monotonicity in AI Alignment— Gilles Bareilles, Julien Fageot, Lê-Nguyên Hoang, Peva Blanchard, Wassim Bouaziz, Sébastien Rouault, El-Mahdi El-Mhamdi
Spectrum Tuning: Post-Training for Distributional Coverage and In-Context Steerability— Taylor Sorensen, Benjamin Newman, Jared Moore, Chan Park, Jillian Fisher, Niloofar Mireshghallah, Liwei Jiang, Yejin Choi
Uncertainty-Aware Step-wise Verification with Generative Reward Models— Zihuiwen Ye, Luckeciano Carvalho Melo, Younesse Kaddar, Phil Blunsom, Sam Staton, Yarin Gal
The Delta Learning Hypothesis: Preference Tuning on Weak Data can Yield Strong Gains— Scott Geng, Hamish Ivison, Chun-Liang Li, Maarten Sap, Jerry Li, Ranjay Krishna, Pang Wei Koh
Training LLMs for Honesty via Confessions— Manas Joglekar, Jeremy Chen, Gabriel Wu, Jason Yosinski, Jasmine Wang, Boaz Barak, Amelia Glaese