Model specs and constitutions
Write detailed, natural language descriptions of values and rules for models to follow, then instill these values and rules into models via techniques like Constitutional AI or deliberative alignment.
Theory of Change:Model specs and constitutions serve three purposes. First, they provide a clear standard of behavior which can be used to train models to value what we want them to value. Second, they serve as something closer to a ground truth standard for evaluating the degree of misalignment ranging from "models straightforwardly obey the spec" to "models flagrantly disobey the spec". A combination of scalable stress-testing and reinforcement for obedience can be used to iteratively reduce the risk of misalignment. Third, they get more useful as models' instruction-following capability improves.
General Approach:Engineering
Target Case:Average Case
Orthodox Problems:
See Also:
Some names:Amanda Askell, Joe Carlsmith
Outputs:
Deliberative Alignment: Reasoning Enables Safer Language Models— Melody Y. Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex Beutel, Amelia Glaese
Stress-Testing Model Specs Reveals Character Differences among Language Models— Jifan Zhang, Henry Sleight, Andi Peng, John Schulman, Esin Durmus
Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and Preferences— Mingqian Zheng, Wenjia Hu, Patrick Zhao, Motahhare Eslami, Jena D. Hwang, Faeze Brahman, Carolyn Rose, Maarten Sap
No-self as an alignment target— Milan W
Six Thoughts on AI Safety— Boaz Barak
Political Neutrality in AI Is Impossible- But Here Is How to Approximate It— Jillian Fisher, Ruth E. Appel, Chan Young Park, Yujin Potter, Liwei Jiang, Taylor Sorensen, Shangbin Feng, Yulia Tsvetkov, Margaret E. Roberts, Jennifer Pan, Dawn Song, Yejin Choi
Giving AIs safe motivations— Joe Carlsmith