Shallow Review of Technical AI Safety, 2025

Model psychopathology

Find interesting LLM phenomena like glitch tokens and the reversal curse; these are vital data for theory.

Theory of Change:The study of 'pathological' phenomena in LLMs is potentially key for theoretically modelling LLM cognition and LLM training-dynamics (compare: the study of aphasia and visual processing disorder in humans plays a key role cognitive science), and in particular for developing a good theory of generalization in LLMS

Target Case:Pessimistic

Orthodox Problems:

4.Goals misgeneralize out of distribution

See Also:

Emergent misalignment, mechanistic anomaly detection

Some names:Truthful AI, Theia Vogel, Stewart Slocum, Nell Watson, Samuel G. B. Johnson, Liwei Jiang, Monika Jotautaite, Saloni Dash

Estimated FTEs:5-20

Outputs:

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data— Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans

LLMs Can Get "Brain Rot"!— Shuo Xing, Junyuan Hong, Yifan Wang, Runjin Chen, Zhenyu Zhang, Ananth Grama, Zhengzhong Tu, Zhangyang Wang

Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning— Saloni Dash, Amélie Reymond, Emma S. Spiro, Aylin Caliskan

Unified Multimodal Models Cannot Describe Images From Memory— Michael Aerni, Joshua Swanson, Kristina Nikolić, Florian Tramèr

Believe It or Not: How Deeply do LLMs Believe Implanted Facts?— Stewart Slocum, Julian Minder, Clément Dumas, Henry Sleight, Ryan Greenblatt, Samuel Marks, Rowan Wang

Psychopathia Machinalis: A Nosological Framework for Understanding Pathologies in Advanced Artificial Intelligence— Nell Watson, Ali Hessami

Imagining and building wise machines: The centrality of AI metacognition— Samuel G. B. Johnson, Amir-Hossein Karimi, Yoshua Bengio, Nick Chater, Tobias Gerstenberg, Kate Larson, Sydney Levine, Melanie Mitchell, Iyad Rahwan, Bernhard Schölkopf, Igor Grossmann

Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)— Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, Alon Albalak, Yejin Choi

Beyond One-Way Influence: Bidirectional Opinion Dynamics in Multi-Turn Human-LLM Interactions— Yuyang Jiang, Longjie Guo, Yuchen Wu, Aylin Caliskan, Tanu Mitra, Hua Shen