Shallow Review of Technical AI Safety, 2025

Extracting latent knowledge

Identify and decoding the "true" beliefs or knowledge represented inside a model's activations, even when the model's output is deceptive or false.
Theory of Change:Powerful models may know things they do not say (e.g. that they are currently being tested). If we can translate this latent knowledge directly from the model's internals, we can supervise them reliably even when they attempt to deceive human evaluators or when the task is too difficult for humans to verify directly.
General Approach:Cognitive
Target Case:Worst Case
Some names:Jacob Steinhardt
Estimated FTEs:20-40
Outputs:
Eliciting Secret Knowledge from Language ModelsBartosz Cywiński, Emil Ryd, Rowan Wang, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy, Samuel Marks
Here's 18 Applications of Deception ProbesCleo Nardo, Avi Parrack, jordine
Towards eliciting latent knowledge from LLMs with mechanistic interpretabilityBartosz Cywiński, Emil Ryd, Senthooran Rajamanoharan, Neel Nanda
CCS-Lib: A Python package to elicit latent knowledge from LLMsWalter Laurito, Nora Belrose, Alex Mallen, Kay Kozaronek, Fabien Roger, Christy Koh, James Chua, Jonathan Ng, Alexander Wan, Reagan Lee, Ben W., Kyle O'Brien, Augustas Macijauskas, Eric Mungai Kinuthia, Marius Pl, Waree Sethapun, Kaarel Hänni
No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear ProbesIván Vicente Moreno Cencerrado, Arnau Padrés Masdemont, Anton Gonzalvez Hawthorne, David Demitri Africa, Lorenzo Pacchiardi
Caught in the Act: a mechanistic approach to detecting deceptionGerard Boxo, Ryan Socha, Daniel Yoo, Shivam Raval
When Truthful Representations Flip Under Deceptive Instructions?Xianxuan Long, Yao Fu, Runchao Li, Mu Sheng, Haotian Yu, Xiaotian Han, Pan Li