Steganography evals
evaluate whether models can hide secret information or encoded reasoning in their outputs, such as in chain-of-thought scratchpads, to evade monitoring.
Theory of Change:if models can use steganography, they could hide deceptive reasoning, bypassing safety monitoring and control measures. By evaluating this capability, we can assess the risk of a model fooling its supervisors.
General Approach:Behavioral
Target Case:Worst Case
Estimated FTEs:1-10
Critiques:
Chain-of-Thought Is Already Unfaithful (So Steganography is Irrelevant): Reasoning Models Don't Always Say What They Think.
Outputs:
Large language models can learn and generalize steganographic chain-of-thought under process supervision— Joey Skaf, Luis Ibanez-Lissen, Robert McCarthy, Connor Watts, Vasil Georgiv, Hannes Whittingham, Lorena Gonzalez-Manzano, David Lindner, Cameron Tice, Edward James Young, Puria Radmard
Early Signs of Steganographic Capabilities in Frontier LLMs— Artur Zolkowski, Kei Nishimura-Gasparian, Robert McCarthy, Roland S. Zimmermann, David Lindner
Subliminal Learning: Language models transmit behavioral traits via hidden signals in data— Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans
LLMs can hide text in other text of the same length— Antonio Norelli, Michael Bronstein