Shallow Review of Technical AI Safety, 2025

Model specs and constitutions

Write detailed, natural language descriptions of values and rules for models to follow, then instill these values and rules into models via techniques like Constitutional AI or deliberative alignment.
Theory of Change:Model specs and constitutions serve three purposes. First, they provide a clear standard of behavior which can be used to train models to value what we want them to value. Second, they serve as something closer to a ground truth standard for evaluating the degree of misalignment ranging from "models straightforwardly obey the spec" to "models flagrantly disobey the spec". A combination of scalable stress-testing and reinforcement for obedience can be used to iteratively reduce the risk of misalignment. Third, they get more useful as models' instruction-following capability improves.
General Approach:Engineering
Target Case:Average Case
Some names:Amanda Askell, Joe Carlsmith
Outputs:
Deliberative Alignment: Reasoning Enables Safer Language ModelsMelody Y. Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex Beutel, Amelia Glaese
Stress-Testing Model Specs Reveals Character Differences among Language ModelsJifan Zhang, Henry Sleight, Andi Peng, John Schulman, Esin Durmus
Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and PreferencesMingqian Zheng, Wenjia Hu, Patrick Zhao, Motahhare Eslami, Jena D. Hwang, Faeze Brahman, Carolyn Rose, Maarten Sap
Political Neutrality in AI Is Impossible- But Here Is How to Approximate ItJillian Fisher, Ruth E. Appel, Chan Young Park, Yujin Potter, Liwei Jiang, Taylor Sorensen, Shangbin Feng, Yulia Tsvetkov, Margaret E. Roberts, Jennifer Pan, Dawn Song, Yejin Choi