Shallow Review of Technical AI Safety, 2025

Model specs and constitutions

Write detailed, natural language descriptions of values and rules for models to follow, then instill these values and rules into models via techniques like Constitutional AI or deliberative alignment.

Theory of Change:Model specs and constitutions serve three purposes. First, they provide a clear standard of behavior which can be used to train models to value what we want them to value. Second, they serve as something closer to a ground truth standard for evaluating the degree of misalignment ranging from "models straightforwardly obey the spec" to "models flagrantly disobey the spec". A combination of scalable stress-testing and reinforcement for obedience can be used to iteratively reduce the risk of misalignment. Third, they get more useful as models' instruction-following capability improves.

General Approach:Engineering

Target Case:Average Case

Orthodox Problems:

1.Value is fragile and hard to specify

See Also:

Iterative alignment, Model psychology

Some names:Amanda Askell, Joe Carlsmith

Critiques:

LLM AGI may reason about its goals and discover misalignments by default, On OpenAI's Model Spec 2.0, Giving AIs safe motivations (esp. Sections 4.3-4.5), On Deliberative Alignment

Outputs:

Claude's Constitution

Gemini-2.5-Pro-04-18-2025 System Prompt

Deliberative Alignment: Reasoning Enables Safer Language Models— Melody Y. Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex Beutel, Amelia Glaese

Stress-Testing Model Specs Reveals Character Differences among Language Models— Jifan Zhang, Henry Sleight, Andi Peng, John Schulman, Esin Durmus

OpenAI Model Spec

Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and Preferences— Mingqian Zheng, Wenjia Hu, Patrick Zhao, Motahhare Eslami, Jena D. Hwang, Faeze Brahman, Carolyn Rose, Maarten Sap

No-self as an alignment target— Milan W

Six Thoughts on AI Safety— Boaz Barak

How important is the model spec if alignment fails?— Mia Taylor

Political Neutrality in AI Is Impossible- But Here Is How to Approximate It— Jillian Fisher, Ruth E. Appel, Chan Young Park, Yujin Potter, Liwei Jiang, Taylor Sorensen, Shangbin Feng, Yulia Tsvetkov, Margaret E. Roberts, Jennifer Pan, Dawn Song, Yejin Choi

Giving AIs safe motivations— Joe Carlsmith