Extracting latent knowledge
Identify and decoding the "true" beliefs or knowledge represented inside a model's activations, even when the model's output is deceptive or false.
Theory of Change:Powerful models may know things they do not say (e.g. that they are currently being tested). If we can translate this latent knowledge directly from the model's internals, we can supervise them reliably even when they attempt to deceive human evaluators or when the task is too difficult for humans to verify directly.
General Approach:Cognitive
Target Case:Worst Case
Orthodox Problems:
Some names:Jacob Steinhardt
Estimated FTEs:20-40
Outputs:
Eliciting Secret Knowledge from Language Models— Bartosz Cywiński, Emil Ryd, Rowan Wang, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy, Samuel Marks
Here's 18 Applications of Deception Probes— Cleo Nardo, Avi Parrack, jordine
Towards eliciting latent knowledge from LLMs with mechanistic interpretability— Bartosz Cywiński, Emil Ryd, Senthooran Rajamanoharan, Neel Nanda
CCS-Lib: A Python package to elicit latent knowledge from LLMs— Walter Laurito, Nora Belrose, Alex Mallen, Kay Kozaronek, Fabien Roger, Christy Koh, James Chua, Jonathan Ng, Alexander Wan, Reagan Lee, Ben W., Kyle O'Brien, Augustas Macijauskas, Eric Mungai Kinuthia, Marius Pl, Waree Sethapun, Kaarel Hänni
No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes— Iván Vicente Moreno Cencerrado, Arnau Padrés Masdemont, Anton Gonzalvez Hawthorne, David Demitri Africa, Lorenzo Pacchiardi
When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models— Kai Wang, Yihao Zhang, Meng Sun
Caught in the Act: a mechanistic approach to detecting deception— Gerard Boxo, Ryan Socha, Daniel Yoo, Shivam Raval
When Truthful Representations Flip Under Deceptive Instructions?— Xianxuan Long, Yao Fu, Runchao Li, Mu Sheng, Haotian Yu, Xiaotian Han, Pan Li