Chain of thought monitoring
Supervise an AI's natural-language (output) "reasoning" to detect misalignment, scheming, or deception, rather than studying the actual internal states.
Theory of Change:The reasoning process (Chain of Thought, or CoT) of an AI provides a legible signal of its internal state and intentions. By monitoring this CoT, supervisors (human or AI) can detect misalignment, scheming, or reward hacking before it results in a harmful final output. This allows for more robust oversight than supervising outputs alone, but it relies on the CoT remaining faithful (i.e., accurately reflecting the model's reasoning) and not becoming obfuscated under optimization pressure.
General Approach:Engineering
Target Case:Average Case
Some names:Bowen Baker, Joost Huizinga, Leo Gao, Scott Emmons, Erik Jenner, Yanda Chen, James Chua, Owain Evans, Tomek Korbak, Mikita Balesni, Xinpeng Wang, Miles Turpin, Rohin Shah
Estimated FTEs:10-100
Outputs:
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation— Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y. Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, David Farhi
Detecting misbehavior in frontier reasoning models— Bowen Baker, Joost Huizinga, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, David Farhi
When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors— Scott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, Rohin Shah
Reasoning Models Don't Always Say What They Think— Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, Ethan Perez
Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort— Xinpeng Wang, Nitish Joshi, Barbara Plank, Rico Angell, He He
CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring— Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky, Hannes Whittingham, Mary Phuong
Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability— Artur Zolkowski, Wen Xing, David Lindner, Florian Tramèr, Erik Jenner
Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning— Miles Turpin, Andy Arditi, Marvin Li, Joe Benton, Julian Michael
Are DeepSeek R1 And Other Reasoning Models More Faithful?— James Chua, Owain Evans
A Pragmatic Way to Measure Chain-of-Thought Monitorability— Scott Emmons, Roland S. Zimmermann, David K. Elson, Rohin Shah
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety— Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksander Mądry, Julian Michael, Neel Nanda, Dave Orr, Jakub Pachocki, Ethan Perez, Mary Phuong, Fabien Roger, Joshua Saxe, Buck Shlegeris, Martín Soto, Eric Steinberger, Jasmine Wang, Wojciech Zaremba, Bowen Baker, Rohin Shah, Vlad Mikulik
CoT May Be Highly Informative Despite "Unfaithfulness"— Amy Deng, Sydney Von Arx, Ben Snodin, Sudarsh Kunnavakkam, Tamera Lanham
Aether July 2025 Update— Rohan Subramani, Rauno Arike, Shubhorup Biswas