Shallow Review of Technical AI Safety, 2025

Extracting latent knowledge

Identify and decoding the "true" beliefs or knowledge represented inside a model's activations, even when the model's output is deceptive or false.

Theory of Change:Powerful models may know things they do not say (e.g. that they are currently being tested). If we can translate this latent knowledge directly from the model's internals, we can supervise them reliably even when they attempt to deceive human evaluators or when the task is too difficult for humans to verify directly.

General Approach:Cognitive

Target Case:Worst Case

Orthodox Problems:

7.Superintelligence can fool human supervisors

See Also:

AI explanations of AIs, Heuristic explanations, Lie and deception detectors

Some names:Jacob Steinhardt

Estimated FTEs:20-40

Critiques:

A Problem to Solve Before Building a Deception Detector

Outputs:

Auditing language models for hidden objectives

Eliciting Secret Knowledge from Language Models— Bartosz Cywiński, Emil Ryd, Rowan Wang, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy, Samuel Marks

Here's 18 Applications of Deception Probes— Cleo Nardo, Avi Parrack, jordine

Towards eliciting latent knowledge from LLMs with mechanistic interpretability— Bartosz Cywiński, Emil Ryd, Senthooran Rajamanoharan, Neel Nanda

CCS-Lib: A Python package to elicit latent knowledge from LLMs— Walter Laurito, Nora Belrose, Alex Mallen, Kay Kozaronek, Fabien Roger, Christy Koh, James Chua, Jonathan Ng, Alexander Wan, Reagan Lee, Ben W., Kyle O'Brien, Augustas Macijauskas, Eric Mungai Kinuthia, Marius Pl, Waree Sethapun, Kaarel Hänni

No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes— Iván Vicente Moreno Cencerrado, Arnau Padrés Masdemont, Anton Gonzalvez Hawthorne, David Demitri Africa, Lorenzo Pacchiardi

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models— Kai Wang, Yihao Zhang, Meng Sun

Caught in the Act: a mechanistic approach to detecting deception— Gerard Boxo, Ryan Socha, Daniel Yoo, Shivam Raval

When Truthful Representations Flip Under Deceptive Instructions?— Xianxuan Long, Yao Fu, Runchao Li, Mu Sheng, Haotian Yu, Xiaotian Han, Pan Li