Shallow Review of Technical AI Safety, 2025

Chain of thought monitoring

Supervise an AI's natural-language (output) "reasoning" to detect misalignment, scheming, or deception, rather than studying the actual internal states.
Theory of Change:The reasoning process (Chain of Thought, or CoT) of an AI provides a legible signal of its internal state and intentions. By monitoring this CoT, supervisors (human or AI) can detect misalignment, scheming, or reward hacking before it results in a harmful final output. This allows for more robust oversight than supervising outputs alone, but it relies on the CoT remaining faithful (i.e., accurately reflecting the model's reasoning) and not becoming obfuscated under optimization pressure.
General Approach:Engineering
Target Case:Average Case
Some names:Bowen Baker, Joost Huizinga, Leo Gao, Scott Emmons, Erik Jenner, Yanda Chen, James Chua, Owain Evans, Tomek Korbak, Mikita Balesni, Xinpeng Wang, Miles Turpin, Rohin Shah
Estimated FTEs:10-100
Outputs:
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting ObfuscationBowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y. Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, David Farhi
Detecting misbehavior in frontier reasoning modelsBowen Baker, Joost Huizinga, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, David Farhi
When Chain of Thought is Necessary, Language Models Struggle to Evade MonitorsScott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, Rohin Shah
Reasoning Models Don't Always Say What They ThinkYanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, Ethan Perez
Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning EffortXinpeng Wang, Nitish Joshi, Barbara Plank, Rico Angell, He He
CoT Red-Handed: Stress Testing Chain-of-Thought MonitoringBenjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky, Hannes Whittingham, Mary Phuong
Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought MonitorabilityArtur Zolkowski, Wen Xing, David Lindner, Florian Tramèr, Erik Jenner
Teaching Models to Verbalize Reward Hacking in Chain-of-Thought ReasoningMiles Turpin, Andy Arditi, Marvin Li, Joe Benton, Julian Michael
A Pragmatic Way to Measure Chain-of-Thought MonitorabilityScott Emmons, Roland S. Zimmermann, David K. Elson, Rohin Shah
Chain of Thought Monitorability: A New and Fragile Opportunity for AI SafetyTomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksander Mądry, Julian Michael, Neel Nanda, Dave Orr, Jakub Pachocki, Ethan Perez, Mary Phuong, Fabien Roger, Joshua Saxe, Buck Shlegeris, Martín Soto, Eric Steinberger, Jasmine Wang, Wojciech Zaremba, Bowen Baker, Rohin Shah, Vlad Mikulik
CoT May Be Highly Informative Despite "Unfaithfulness"Amy Deng, Sydney Von Arx, Ben Snodin, Sudarsh Kunnavakkam, Tamera Lanham
Aether July 2025 UpdateRohan Subramani, Rauno Arike, Shubhorup Biswas