Shallow Review of Technical AI Safety, 2025

Iterative alignment at post-train-time

Modify weights after pre-training.
Theory of Change:"LLMs don't seem very dangerous and might scale to AGI, things are generally smooth, relevant capabilities are harder than alignment, assume no mesaoptimisers, assume that zero-shot deception is hard, assume a fundamentally humanish ontology is learned, assume no simulated agents, assume that noise in the data means that human preferences are not ruled out, assume that alignment is a superficial feature, assume that tuning for what we want will also get us to avoid what we don't want. Maybe assume that thoughts are translucent."
General Approach:Engineering
Target Case:Average Case
Some names:Anca Dragan, Jacob Steinhardt, Rohin Shah
Outputs:
Composable Interventions for Language ModelsArinbjorn Kolbeinsson, Kyle O'Brien, Tianjin Huang, Shanghua Gao, Shiwei Liu, Jonathan Richard Schwarz, Anurag Vaidya, Faisal Mahmood, Marinka Zitnik, Tianlong Chen, Thomas Hartvigsen
On Targeted Manipulation and Deception when Optimizing LLMs for User FeedbackMarcus Williams, Micah Carroll, Adhyyan Narang, Constantin Weisser, Brendan Murphy, Anca Dragan
Robust LLM Alignment via Distributionally Robust Direct Preference OptimizationZaiyan Xu, Sushil Vemuri, Kishan Panaganti, Dileep Kalathil, Rahul Jain, Deepak Ramachandran
RLHS: Mitigating Misalignment in RLHF with Hindsight SimulationKaiqu Liang, Haimin Hu, Ryan Liu, Thomas L. Griffiths, Jaime Fernández Fisac
Consistency Training Helps Stop Sycophancy and JailbreaksAlex Irpan, Alexander Matt Turner, Mark Kurzeja, David K. Elson, Rohin Shah
Rethinking Safety in LLM Fine-tuning: An Optimization PerspectiveMinseon Kim, Jin Myung Kwak, Lama Alssum, Bernard Ghanem, Philip Torr, David Krueger, Fazl Barez, Adel Bibi
Preference Learning for AI Alignment: a Causal PerspectiveKatarzyna Kobalczyk, Mihaela van der Schaar
On Monotonicity in AI AlignmentGilles Bareilles, Julien Fageot, Lê-Nguyên Hoang, Peva Blanchard, Wassim Bouaziz, Sébastien Rouault, El-Mahdi El-Mhamdi
Spectrum Tuning: Post-Training for Distributional Coverage and In-Context SteerabilityTaylor Sorensen, Benjamin Newman, Jared Moore, Chan Park, Jillian Fisher, Niloofar Mireshghallah, Liwei Jiang, Yejin Choi
Uncertainty-Aware Step-wise Verification with Generative Reward ModelsZihuiwen Ye, Luckeciano Carvalho Melo, Younesse Kaddar, Phil Blunsom, Sam Staton, Yarin Gal
The Delta Learning Hypothesis: Preference Tuning on Weak Data can Yield Strong GainsScott Geng, Hamish Ivison, Chun-Liang Li, Maarten Sap, Jerry Li, Ranjay Krishna, Pang Wei Koh
Training LLMs for Honesty via ConfessionsManas Joglekar, Jeremy Chen, Gabriel Wu, Jason Yosinski, Jasmine Wang, Boaz Barak, Amelia Glaese