Shallow Review of Technical AI Safety, 2025

Steganography evals

evaluate whether models can hide secret information or encoded reasoning in their outputs, such as in chain-of-thought scratchpads, to evade monitoring.

Theory of Change:if models can use steganography, they could hide deceptive reasoning, bypassing safety monitoring and control measures. By evaluating this capability, we can assess the risk of a model fooling its supervisors.

General Approach:Behavioral

Target Case:Worst Case

Orthodox Problems:

12.A boxed AGI might exfiltrate itself 7.Superintelligence can fool human supervisors

See Also:

AI deception evals, Chain of thought monitoring

Estimated FTEs:1-10

Critiques:

Chain-of-Thought Is Already Unfaithful (So Steganography is Irrelevant): Reasoning Models Don't Always Say What They Think.

Outputs:

Large language models can learn and generalize steganographic chain-of-thought under process supervision— Joey Skaf, Luis Ibanez-Lissen, Robert McCarthy, Connor Watts, Vasil Georgiv, Hannes Whittingham, Lorena Gonzalez-Manzano, David Lindner, Cameron Tice, Edward James Young, Puria Radmard

Early Signs of Steganographic Capabilities in Frontier LLMs— Artur Zolkowski, Kei Nishimura-Gasparian, Robert McCarthy, Roland S. Zimmermann, David Lindner

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data— Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans

LLMs can hide text in other text of the same length— Antonio Norelli, Michael Bronstein

Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases